# Data Visualization

### Total Number of .pdf files present in the workdir.

In [2]:
import os
from pathlib import Path


# Lets create a fucnction to find .pdf files and marrk them in pdf_files
def get_all_pdfs(root_dir):
    pdf_files = []
    for dirpath, _, filenames in os.walk(root_dir):
        for f in filenames:
            if f.lower().endswith(".pdf"):
                pdf_files.append(os.path.join(dirpath, f))
    return pdf_files


# Set the target directory
target_dir = r"F://automate-accounts//automate-accouts-software-//automate-accounts-developer-hiring-assessment"
pdfs = get_all_pdfs(target_dir)
print(f"Found Total : {len(pdfs)} PDFs in the working directory")

Found Total : 181 PDFs in the working directory


## Technical Background: How pdfplumber Works

## Overview
`pdfplumber` is a Python library designed for extracting text, tables, and metadata from PDF files. It provides fine-grained access to the content and layout of each page, making it especially useful for structured data extraction and document analysis.

## How pdfplumber Works
- **PDF Parsing:** pdfplumber is built on top of the `pdfminer.six` library, which parses the raw PDF file format. PDF files are complex binary documents that store text, images, vector graphics, and layout instructions. pdfplumber leverages pdfminer to decode these elements into Python objects.
- **Page Objects:** When you open a PDF with pdfplumber, each page is represented as a `Page` object. This object contains methods to extract text, tables, images, and geometric information.
- **Text Extraction:** pdfplumber can extract text in two ways:
    - **Raw Text:** Using `page.extract_text()`, it reconstructs the text by analyzing the position and order of characters and words on the page.
    - **Character-Level Data:** With `page.chars`, you can access the position, font, and other metadata for every character.
- **Table Extraction:** pdfplumber uses algorithms to detect lines and whitespace, segmenting the page into rows and columns. The `page.extract_tables()` method returns tables as lists of lists, which can be easily converted to pandas DataFrames.
- **Layout Analysis:** pdfplumber provides access to the geometric layout of each page, including bounding boxes for text, lines, rectangles, and images. This allows for custom extraction and visualization.
- **Image Extraction:** You can extract raster images embedded in the PDF using `page.images` and `page.to_image()` for rendering.

## Features
- **Precise Text Extraction:** Handles multi-column layouts, rotated text, and non-standard fonts.
- **Table Detection:** Identifies tables using lines, whitespace, and text alignment.
- **Visual Debugging:** The `page.to_image()` method lets you overlay extracted elements on the page image for debugging and validation.
- **Metadata Access:** Extracts document metadata, page dimensions, and more.
- **Integration:** Works seamlessly with pandas for data analysis and with matplotlib for visualization.

## Typical Workflow
1. **Open PDF:** `with pdfplumber.open('file.pdf') as pdf:`
2. **Iterate Pages:** `for page in pdf.pages:`
3. **Extract Text:** `text = page.extract_text()`
4. **Extract Tables:** `tables = page.extract_tables()`
5. **Visualize:** `img = page.to_image().draw_rects(page.extract_words())`
6. **Convert to DataFrame:** `df = pd.DataFrame(table[1:], columns=table[0])`

## Limitations
- **Scanned PDFs:** pdfplumber cannot extract text from scanned images (use OCR libraries like pytesseract for those).
- **Complex Layouts:** Highly complex or irregular layouts may require custom extraction logic.
- **PDF Variability:** Not all PDFs are created equal; extraction quality depends on how the PDF was generated.

## Summary
pdfplumber is a powerful tool for extracting structured and unstructured data from PDF files, with deep access to layout and content. It is ideal for data science, document analysis, and automation tasks involving PDFs.

In [None]:
import pdfplumber
import matplotlib.pyplot as plt
import os
import spacy
import glob
import pandas as pd
import matplotlib.pyplot as plt
import random

from PIL import Image
from PIL.ImageDraw import ImageDraw


### Method 1: pdfplumber (text extraction)


In [7]:
def extract_text_pdfplumber(pdf_path):
    text = ""
    try:
        with pdfplumber.open(pdf_path) as pdf:
            for page in pdf.pages:
                page_text = page.extract_text()
                if page_text:
                    text += page_text + "\n"
    except Exception as e:
        print(f"Error extracting from {pdf_path}: {e}")
        text = ""
    return text