# OCR and retrive text from PDF

Currently, we extract the data from the OPS API to analyze all the patents. But we also want to classify directly the PDF for multiple cases :
- **Descriptions not available**: Some patents do not have a description available in the OPS API. In this case, we can use OCR to extract the text from the PDF and classify it.
- **Old patents**: If we want to make studies on old patents, we can use OCR to extract the text from the PDF and classify it.
- **Test your patent online**: If you want to test your patent online, you can use OCR to extract the text from the PDF or just extract the text from the PDF and classify it.

In [None]:

from pdfminer.high_level import extract_text
from PyPDF2 import PdfReader

def extract_text_from_pdf(pdf_path, exclude_pages=[]):
    """Extracts text from a PDF file and returns it as a string.
    
    Args:
        pdf_path (str): The path to the PDF file.
        exclude_pages (list): A list of page numbers to exclude from extraction.
        
    Returns:
        str: The extracted text from the PDF.
    """
    total_pages = len(PdfReader(pdf_path).pages)
    
    text = extract_text(pdf_path, page_numbers=[i for i in range(total_pages) if i not in exclude_pages])
    return text

pdf_text = extract_text_from_pdf("/home/quentin/Downloads/EP_4054160_A1.pdf", exclude_pages=[0])
print(repr(pdf_text))
     

'1\n\nEP 4 054 160 A1\n\n2 \n\nDescription\n\nCROSS-REFERENCE TO RELATED APPLICATIONS\n\n[0001] This application is a continuation-in-part appli-\ncation of and claims priority to U.S. Application Serial No.\n16/986,159, filed August 5, 2020, which is a continuation\nof U.S. Application Serial No. 16/844,783, filed April 9,\n2020, which is a continuation of U.S. Application Serial\nNo. 16/435,379, filed June 7, 2019, which is a continua-\ntion-in-part application of and claims priority to U.S. Ap-\nplication Serial No. 16/163,434, filed October 17, 2018,\nwhich  is  a  continuation  of  U.S.  Application  Serial  No.\n15/642,267, filed July 5, 2017, which claims the benefit\nof  U.S.  Application Serial  No.  62/358,996, filed  July  6,\n2016, the disclosures of which are incorporated by ref-\nerence.\n\nTECHNICAL FIELD\n\n[0002] The present disclosure relates to a mobile ac-\ncessory device, for example, one that includes an alarm\ndevice for personal protection purposes.\n\nBACKGROUN

In [4]:
from pdf2image import convert_from_path
import easyocr
import numpy as np
from tqdm.notebook import tqdm

def extract_text_from_pdf(pdf_path, exclude_pages=[]):
    """Extracts text from a PDF file and returns it as a string.
    
    Args:
        pdf_path (str): The path to the PDF file.
        exclude_pages (list): A list of page numbers to exclude from extraction.
        
    Returns:
        str: The extracted text from the PDF.
    """
    # Convert PDF to images
    images = convert_from_path(pdf_path)
    
    # Init EasyOCR reader
    reader = easyocr.Reader(['en', 'fr', 'de'], gpu=False)
    
    text = ""
    for i, image in enumerate(tqdm(images)):
        if i in exclude_pages:
            continue
        # Convert PIL image to NumPy array
        image_np = np.array(image)
        # Perform OCR on the image
        result = reader.readtext(image_np)
        # Extract text from the result
        for detection in result:
            text += detection[1] + "\n"

    return text

pdf_text = extract_text_from_pdf("/home/quentin/Downloads/EP_4054160_A1.pdf", exclude_pages=[0])
print(repr(pdf_text))

Using CPU. Note: This module is much faster with a GPU.


  0%|          | 0/72 [00:00<?, ?it/s]



KeyboardInterrupt: 

In [10]:
def filter_numbers(text):
    """Filters out numbers from the given text (lines numbers, pages numbers, columns numbers).
    
    Args:
        text (str): The input text.
        
    Returns:
        str: The text with numbers filtered out.
    """
    # Split the text into lines
    lines = text.splitlines()
    
    # Filter out lines that contain only numbers
    filtered_lines = [line for line in lines if not line.strip().isdigit()]
    
    # Join the filtered lines back into a single string
    return "\n".join(filtered_lines)

filtered_text = filter_numbers(pdf_text)
print(repr(filtered_text))

'\nEP 4 054 160 A1\n\n\nDescription\n\nCROSS-REFERENCE TO RELATED APPLICATIONS\n\n[0001] This application is a continuation-in-part appli-\ncation of and claims priority to U.S. Application Serial No.\n16/986,159, filed August 5, 2020, which is a continuation\nof U.S. Application Serial No. 16/844,783, filed April 9,\n2020, which is a continuation of U.S. Application Serial\nNo. 16/435,379, filed June 7, 2019, which is a continua-\ntion-in-part application of and claims priority to U.S. Ap-\nplication Serial No. 16/163,434, filed October 17, 2018,\nwhich  is  a  continuation  of  U.S.  Application  Serial  No.\n15/642,267, filed July 5, 2017, which claims the benefit\nof  U.S.  Application Serial  No.  62/358,996, filed  July  6,\n2016, the disclosures of which are incorporated by ref-\nerence.\n\nTECHNICAL FIELD\n\n[0002] The present disclosure relates to a mobile ac-\ncessory device, for example, one that includes an alarm\ndevice for personal protection purposes.\n\nBACKGROUND\n\n[0

In [9]:
def split_into_chunks(text):
    """Splits the text into chunks for each line break.
    
    Args:
        text (str): The input text.
        
    Returns:
        list: A list of text chunks.
    """
    # Split the text into chunks based on line breaks
    chunks = text.split("\n\n")
    
    # Remove word breaks and line breaks
    chunks = [chunk.replace("-\n", "") for chunk in chunks]
    chunks = [chunk.replace("\n", "") for chunk in chunks]
    
    # Remove empty strings from the list
    chunks = [chunk for chunk in chunks if chunk.strip()]
    
    return chunks
chunks = split_into_chunks(filtered_text)
print(chunks)

['EP 4 054 160 A1', 'Description', 'CROSS-REFERENCE TO RELATED APPLICATIONS', '[0001] This application is a continuation-in-part application of and claims priority to U.S. Application Serial No.16/986,159, filed August 5, 2020, which is a continuationof U.S. Application Serial No. 16/844,783, filed April 9,2020, which is a continuation of U.S. Application SerialNo. 16/435,379, filed June 7, 2019, which is a continuation-in-part application of and claims priority to U.S. Application Serial No. 16/163,434, filed October 17, 2018,which  is  a  continuation  of  U.S.  Application  Serial  No.15/642,267, filed July 5, 2017, which claims the benefitof  U.S.  Application Serial  No.  62/358,996, filed  July  6,2016, the disclosures of which are incorporated by reference.', 'TECHNICAL FIELD', '[0002] The present disclosure relates to a mobile accessory device, for example, one that includes an alarmdevice for personal protection purposes.', 'BACKGROUND', '[0003] Personal safety remains a chall