Skip to content

extracts text from pdfs and powerpoint documents and summarizes it into key points and bullet lists, uses LangChain and Google's Gemini.

License

Notifications You must be signed in to change notification settings

isaiah76/Reviewer

Repository files navigation

Reviewer

Reviewer is a document processing tool that extracts text from PDF and PowerPoint documents (.pdf, .pptx, .ppt) and summarizes it using LangChain and Google's Gemini AI. This project supports OCR based extraction and image based PDF's.

Features

  • Extract text from PDF and PPT/PPTX files.
  • Uses OCR Tesseract to extract text from images within PDF's and PowerPoint slides.
  • Summarizes into a clear overview.
  • Answers user questions based on the document.
  • Maintains conversation context and memory.
  • Uses LangChain and Google's Gemini API.
  • Supports large documents by chunking the document before processing.

Project Structure

Reviewer/
├── main.py                    # Entry point
├── requirements.txt           # Dependencies
├── .env                       # Environment file
├── config/
│   └── settings.py            # Configuration settings
├── core/
│   ├── __init__.py
│   ├── document_processor.py  # Document text extraction
│   ├── text_chunker.py        # Text chunking utilities
│   ├── ai_service.py          # LLM model integration
│   └── cli.py                 # User interface functions
└── utils/
    ├── __init__.py
    ├── file_helpers.py        # File validation utilities
    └── converters.py          # Conversion utilities

Requirements

  • Python 3.8 or higher
  • Tesseract OCR
  • Poppler
  • LibreOffice (Optional)

It is recommended to install unoconv or unoserver aside from LibreOffice for better performance.

Installation

Clone the repository:

git clone https://github.com/isaiah76/Reviewer.git
cd reviewer

Linux Installation

Run the provided installation script:

chmod +x install.sh && ./install.sh

Windows Installation

Run the provided batch script:

install.bat

Dependencies

Ensure the following dependencies are installed before running the program

Tesseract OCR

Linux:

  • Debian/Ubuntu:
    sudo apt install tesseract-ocr
  • Arch Linux:
    sudo pacman -S tesseract 
  • Fedora:
    sudo dnf install tesseract

macOS:

brew install tesseract

Windows: Download and install from Tesseract OCR. Ensure the installation path is added to your system PATH.

Poppler

Linux:

  • Debian/Ubuntu:
    sudo apt install poppler-utils
  • Arch Linux:
    sudo pacman -S poppler
  • Fedora:
    sudo dnf install poppler-utils

macOS:

brew install poppler

Windows: Download from Poppler for Windows and add it to the system PATH.

LibreOffice or unoconv/unoserver (Optional tools for .ppt to .pptx conversion):

Linux:

  • Debian/Ubuntu:
    sudo apt install libreoffice unoconv
  • Arch Linux:
    sudo pacman -S libreoffice-fresh
  • Fedora:
    sudo dnf install libreoffice unoconv

macOS:

brew install libreoffice

for unoserver:

pip install unoserver

If the installation script didn't process the packages correctly, you can install them directly:

pip install -r requirements.txt

Or if the requirements.txt is missing, install the required packages manually:

pip install python-dotenv langchain langchain-google-genai google-generativeai PyPDF2 python-pptx pyfiglet pytesseract pdf2image Pillow

Environment Variables

Before running the program, create a .env file in your root project (or use the provided .env.example as a guide).

Google Gemini API Key

  1. Visit the Google AI Studio website
  2. Sign in with your Google account
  3. Navigate to "Get API key" or go to your profile settings
  4. Create a new API key or use an existing one
  5. Copy the API key into your .env file

Include your Gemini API key:

GEMINI_API_KEY=your_api_key_here

Optionally; set the Tesseract command if it not in your PATH:

TESSERACT_CMD=/path/to/tesseract  # Linux/macOS
TESSERACT_CMD=C:\\Program Files\\Tesseract-OCR\\tesseract.exe  # Windows

Usage

Currently there are two options to choose from for usage:

  1. Provide the path in CLI Arugments
python3 main.py path/to/file
  1. Enter the path when prompted
python3 main.py

and then enter the path when prompted:

Please enter the path to your file (.pdf, .pptx, or .ppt): path/to/file

Currently only 1 file can be processed at a time.

To exit the program, press Ctrl + C. Or, if you are prompted, type exit and press Enter.

Contributing

Contributions are welcome! Please submit a pull request or open an issue for discussion.

About

extracts text from pdfs and powerpoint documents and summarizes it into key points and bullet lists, uses LangChain and Google's Gemini.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published