# Image Data Extraction with OCR (Pandas)

## Project Overview
This notebook demonstrates how to extract tabular data from images using OCR (Optical Character Recognition) with `pytesseract` and process it with `pandas`. The workflow includes installing dependencies, loading the image, performing OCR, and converting the extracted text into a structured DataFrame.

## Tech Stack
- Python 3.12
- `pytesseract` (Python wrapper for Tesseract OCR)
- `Pillow` for image handling
- `pandas` for data manipulation
- Optional: `easyocr` for more complex table layouts

---


In [None]:
# Install required Python packages (run once)
!pip install -q pytesseract pillow pandas  # quiet install

# Optional: install EasyOCR for advanced table extraction
# !pip install -q easyocr


## Install Tesseract OCR Engine
`pytesseract` is only a Python wrapper â€“ you need the underlying Tesseract OCR binary installed on your system.

**macOS (Homebrew)**
```bash
brew install tesseract
tesseract --version  # verify installation
```

**Linux (apt)**
```bash
sudo apt-get update && sudo apt-get install -y tesseract-ocr
tesseract --version
```

**Windows**
1. Download the installer from https://github.com/tesseract-ocr/tesseract/releases
2. Add the installation directory (e.g., `C:\Program Files\Tesseract-OCR`) to your system `PATH`.

After installation, restart your terminal/Jupyter kernel and verify with `tesseract --version`.

---


In [None]:
from PIL import Image
import matplotlib.pyplot as plt
import os

# ðŸ‘‰ Replace the placeholder with the actual path to your image file
# Best practice: Use an environment variable or a relative path
image_path = os.getenv("IMAGE_PATH", "data/sample_table.jpg")

if os.path.exists(image_path):
    img = Image.open(image_path)
    plt.imshow(img)
    plt.axis('off')
    plt.show()
else:
    print(f"Image not found at {image_path}. Please check the path.")

In [None]:
import pytesseract

if os.path.exists(image_path):
    # Perform OCR on the image
    try:
        text = pytesseract.image_to_string(img)
        print("--- Extracted Text ---")
        print(text)
    except Exception as e:
        print(f"OCR Error: {e}. Ensure Tesseract is installed and in your PATH.")

## Next Steps (Optional)
- Clean the raw OCR output (remove extra whitespace, line breaks).
- Use regular expressions or `pandas.read_fwf` to parse fixedâ€‘width columns.
- For more complex tables, try `easyocr` or `camelot` on a PDF conversion of the image.
- Store the resulting DataFrame to CSV/Excel for downstream analysis.

---
*All code cells are intentionally minimal for clarity. Adjust paths and parameters to fit your specific useâ€‘case.*