# Extract text from PDF files with PyPDF2 and Pdfminer

## PyPDF2

In [1]:
import PyPDF2

# creating a pdf file object
pdfFileObj = open('../resume/resume_skills_example.pdf', 'rb')

# creating a pdf reader object
pdfReader = PyPDF2.PdfReader(pdfFileObj)

# extracting text from page
extracted_text = ""
for page in pdfReader.pages:
    extracted_text += page.extract_text().strip()  

print(extracted_text)

# closing the pdf file object
pdfFileObj.close()

SKILLS  
 
● Advanced proficiency in SQL, Java, P ython and Apache Spark.  
● Machine learning and Data Science: Scikit -learn and TensorFlow.  
● Experience with data visualization and reporting tools such as Tableau  and Flask . 
● Strong analytical and problem -solving skills .


As we can see, there is a whitespace issue in the extract_text() method of `PyPDF2`. 

Examples: P ython, Scikit -learn.

To solve this problem, we can use [pdfminer](https://pdfminersix.readthedocs.io/en/latest/tutorial/composable.html) which is more accurate in text extraction.

## Pdfminer

In [3]:
from io import StringIO

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

def pdf_miner(file_path):
    output_string = StringIO()
    with open(file_path, "rb") as in_file:
        parser = PDFParser(in_file)
        doc = PDFDocument(parser)
        rsrcmgr = PDFResourceManager()
        device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.create_pages(doc):
            interpreter.process_page(page)

    resume_txt = output_string.getvalue()  # str type
    return resume_txt

In [5]:
output_string = StringIO()

with open('../resume/resume_skills_example.pdf', 'rb') as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)

extracted_text = output_string.getvalue()
print(extracted_text)

SKILLS 

●  Advanced proficiency in SQL, Java, Python and Apache Spark. 
●  Machine learning and Data Science: Scikit-learn and TensorFlow. 
●  Experience with data visualization and reporting tools such as Tableau and Flask. 
●  Strong analytical and problem-solving skills. 

 


