## **Notebook 3: Document Parsers - Extracting Text from PDFs**
## **Introduction:**
Document parsers focus on extracting content from various types of documents, such as PDFs, Word files, or images. In this notebook, we’ll explore the FitzPdfParser, which extracts text from PDF files. This is useful when analyzing or processing content from scanned documents or reports.

**Import dependencies**

In [2]:
from swarmauri_community.parsers.concrete.FitzPdfParser import PDFtoTextParser as Parser

**Instantiate the parser**

In [3]:
parser = Parser()

**Specify the path to the PDF file**

In [9]:
file_path = "main.pdf"

**Parse the PDF to extract text**

In [10]:
documents = parser.parse(file_path)

**Display the extracted text from the PDF**

In [11]:
for document in documents:
    print(f"Extracted Content: {document.content}")
    print(f"Source: {document.metadata['source']}")

Extracted Content: Engineering Applications of Artificial Intelligence 126 (2023) 107021
Available online 4 September 2023
0952-1976/© 2023 Elsevier Ltd. All rights reserved.
Contents lists available at ScienceDirect
Engineering Applications of Artificial Intelligence
journal homepage: www.elsevier.com/locate/engappai
Survey paper
Transformer for object detection: Review and benchmark✩
Yong Li a,∗, Naipeng Miao a, Liangdi Ma b, Feng Shuang a, Xingwen Huang a
a Guangxi Key Laboratory of Intelligent Control and Maintenance of Power Equipment, School of Electrical Engineering, Guangxi University, No. 100, Daxuedong
Road, Xixiangtang District, Nanning, 530004, Guangxi, China
b School of Software, Tsinghua University, No. 30 Shuangqing Road, Haidian District, Beijing, 100084, China
A R T I C L E
I N F O
Keywords:
Review
Object detection
Transformer-based models
COCO2017 dataset
Benchmark
A B S T R A C T
Object detection is a crucial task in computer vision (CV). With the rapid advancement o

## **Explanation:**

Parser Initialization: We import and instantiate the FitzPdfParser.

Parsing PDF: The .parse() method is used to extract text from the specified PDF file.

Result: The extracted text content and source (PDF file path) are printed.

## **Conclusion:**
Document parsers like the FitzPdfParser allow easy extraction of text from PDF documents. They are crucial when working with large volumes of documents that require automated text extraction.



## **NOTEBOOK METADATA**

In [4]:
import os
import platform
import sys
from datetime import datetime

# Display author information
author_name = "Dominion John " 
github_username = "DOMINION-JOHN1"  

print(f"Author: {author_name}")
print(f"GitHub Username: {github_username}")

# Last modified datetime (file's metadata)
notebook_file = "Notebook _3 _Document_parsers.ipynb" 
try:
    last_modified_time = os.path.getmtime(notebook_file)
    last_modified_datetime = datetime.fromtimestamp(last_modified_time)
    print(f"Last Modified: {last_modified_datetime}")
except Exception as e:
    print(f"Could not retrieve last modified datetime: {e}")

# Display platform, Python version, and Swarmauri version
print(f"Platform: {platform.system()} {platform.release()}")
print(f"Python Version: {sys.version}")

# Checking Swarmauri version
try:
    import swarmauri
    print(f"Swarmauri Version: {swarmauri.__version__}")
except ImportError:
    print("Swarmauri is not installed.")



Author: Dominion John 
GitHub Username: DOMINION-JOHN1
Last Modified: 2024-10-21 15:12:19.640951
Platform: Windows 11
Python Version: 3.12.7 (tags/v3.12.7:0b05ead, Oct  1 2024, 03:06:41) [MSC v.1941 64 bit (AMD64)]
Swarmauri Version: 0.5.0
