Reviewer is a document processing tool that extracts text from PDF and PowerPoint documents (.pdf, .pptx, .ppt) and summarizes it using LangChain and Google's Gemini AI. This project supports OCR based extraction and image based PDF's.
- Extract text from PDF and PPT/PPTX files.
- Uses OCR Tesseract to extract text from images within PDF's and PowerPoint slides.
- Summarizes into a clear overview.
- Answers user questions based on the document.
- Maintains conversation context and memory.
- Uses LangChain and Google's Gemini API.
- Supports large documents by chunking the document before processing.
Reviewer/
├── main.py # Entry point
├── requirements.txt # Dependencies
├── .env # Environment file
├── config/
│ └── settings.py # Configuration settings
├── core/
│ ├── __init__.py
│ ├── document_processor.py # Document text extraction
│ ├── text_chunker.py # Text chunking utilities
│ ├── ai_service.py # LLM model integration
│ └── cli.py # User interface functions
└── utils/
├── __init__.py
├── file_helpers.py # File validation utilities
└── converters.py # Conversion utilities
- Python 3.8 or higher
- Tesseract OCR
- Poppler
- LibreOffice (Optional)
It is recommended to install unoconv
or unoserver
aside from LibreOffice
for better performance.
Clone the repository:
git clone https://github.com/isaiah76/Reviewer.git
cd reviewer
Run the provided installation script:
chmod +x install.sh && ./install.sh
Run the provided batch script:
install.bat
Ensure the following dependencies are installed before running the program
Linux:
- Debian/Ubuntu:
sudo apt install tesseract-ocr
- Arch Linux:
sudo pacman -S tesseract
- Fedora:
sudo dnf install tesseract
macOS:
brew install tesseract
Windows: Download and install from Tesseract OCR. Ensure the installation path is added to your system PATH.
Linux:
- Debian/Ubuntu:
sudo apt install poppler-utils
- Arch Linux:
sudo pacman -S poppler
- Fedora:
sudo dnf install poppler-utils
macOS:
brew install poppler
Windows: Download from Poppler for Windows and add it to the system PATH.
Linux:
- Debian/Ubuntu:
sudo apt install libreoffice unoconv
- Arch Linux:
sudo pacman -S libreoffice-fresh
- Fedora:
sudo dnf install libreoffice unoconv
macOS:
brew install libreoffice
for unoserver
:
pip install unoserver
pip install -r requirements.txt
Or if the requirements.txt
is missing, install the required packages manually:
pip install python-dotenv langchain langchain-google-genai google-generativeai PyPDF2 python-pptx pyfiglet pytesseract pdf2image Pillow
Before running the program, create a .env
file in your root project (or use the provided .env.example
as a guide).
- Visit the Google AI Studio website
- Sign in with your Google account
- Navigate to "Get API key" or go to your profile settings
- Create a new API key or use an existing one
- Copy the API key into your
.env
file
Include your Gemini API key:
GEMINI_API_KEY=your_api_key_here
Optionally; set the Tesseract command if it not in your PATH:
TESSERACT_CMD=/path/to/tesseract # Linux/macOS
TESSERACT_CMD=C:\\Program Files\\Tesseract-OCR\\tesseract.exe # Windows
Currently there are two options to choose from for usage:
- Provide the path in CLI Arugments
python3 main.py path/to/file
- Enter the path when prompted
python3 main.py
and then enter the path when prompted:
Please enter the path to your file (.pdf, .pptx, or .ppt): path/to/file
Currently only 1 file can be processed at a time.
To exit the program, press Ctrl + C
. Or, if you are prompted, type exit
and press Enter
.
Contributions are welcome! Please submit a pull request or open an issue for discussion.