Reviewer

Reviewer is a document processing tool that extracts text from PDF and PowerPoint documents (.pdf, .pptx, .ppt) and summarizes it using LangChain and Google's Gemini AI. This project supports OCR based extraction and image based PDF's.

Features

Extract text from PDF and PPT/PPTX files.
Uses OCR Tesseract to extract text from images within PDF's and PowerPoint slides.
Summarizes into a clear overview.
Answers user questions based on the document.
Maintains conversation context and memory.
Uses LangChain and Google's Gemini API.
Supports large documents by chunking the document before processing.

Project Structure

Reviewer/
├── main.py                    # Entry point
├── requirements.txt           # Dependencies
├── .env                       # Environment file
├── config/
│   └── settings.py            # Configuration settings
├── core/
│   ├── __init__.py
│   ├── document_processor.py  # Document text extraction
│   ├── text_chunker.py        # Text chunking utilities
│   ├── ai_service.py          # LLM model integration
│   └── cli.py                 # User interface functions
└── utils/
    ├── __init__.py
    ├── file_helpers.py        # File validation utilities
    └── converters.py          # Conversion utilities

Requirements

Python 3.8 or higher
Tesseract OCR
Poppler
LibreOffice (Optional)

It is recommended to install unoconv or unoserver aside from LibreOffice for better performance.

Installation

Clone the repository:

git clone https://github.com/isaiah76/Reviewer.git
cd reviewer

Linux Installation

Run the provided installation script:

chmod +x install.sh && ./install.sh

Windows Installation

Run the provided batch script:

install.bat

Dependencies

Ensure the following dependencies are installed before running the program

Tesseract OCR

Linux:

Debian/Ubuntu:
```
sudo apt install tesseract-ocr
```
Arch Linux:
```
sudo pacman -S tesseract 
```
Fedora:
```
sudo dnf install tesseract
```

macOS:

brew install tesseract

Windows: Download and install from Tesseract OCR. Ensure the installation path is added to your system PATH.

Poppler

Linux:

Debian/Ubuntu:
```
sudo apt install poppler-utils
```
Arch Linux:
```
sudo pacman -S poppler
```
Fedora:
```
sudo dnf install poppler-utils
```

macOS:

brew install poppler

Windows: Download from Poppler for Windows and add it to the system PATH.

LibreOffice or unoconv/unoserver (Optional tools for .ppt to .pptx conversion):

Linux:

Debian/Ubuntu:
```
sudo apt install libreoffice unoconv
```
Arch Linux:
```
sudo pacman -S libreoffice-fresh
```
Fedora:
```
sudo dnf install libreoffice unoconv
```

macOS:

brew install libreoffice

for unoserver:

pip install unoserver

If the installation script didn't process the packages correctly, you can install them directly:

pip install -r requirements.txt

Or if the requirements.txt is missing, install the required packages manually:

pip install python-dotenv langchain langchain-google-genai google-generativeai PyPDF2 python-pptx pyfiglet pytesseract pdf2image Pillow

Environment Variables

Before running the program, create a .env file in your root project (or use the provided .env.example as a guide).

Google Gemini API Key

Visit the Google AI Studio website
Sign in with your Google account
Navigate to "Get API key" or go to your profile settings
Create a new API key or use an existing one
Copy the API key into your .env file

Include your Gemini API key:

GEMINI_API_KEY=your_api_key_here

Optionally; set the Tesseract command if it not in your PATH:

TESSERACT_CMD=/path/to/tesseract  # Linux/macOS
TESSERACT_CMD=C:\\Program Files\\Tesseract-OCR\\tesseract.exe  # Windows

Usage

Currently there are two options to choose from for usage:

Provide the path in CLI Arugments

python3 main.py path/to/file

Enter the path when prompted

python3 main.py

and then enter the path when prompted:

Please enter the path to your file (.pdf, .pptx, or .ppt): path/to/file

Currently only 1 file can be processed at a time.

To exit the program, press Ctrl + C. Or, if you are prompted, type exit and press Enter.

Contributing

Contributions are welcome! Please submit a pull request or open an issue for discussion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reviewer

Features

Project Structure

Requirements

Installation

Linux Installation

Windows Installation

Dependencies

Tesseract OCR

Poppler

LibreOffice or unoconv/unoserver (Optional tools for .ppt to .pptx conversion):

If the installation script didn't process the packages correctly, you can install them directly:

Environment Variables

Google Gemini API Key

Usage

Contributing

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
config		config
core		core
sample_documents		sample_documents
utils		utils
.env.example		.env.example
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
install.bat		install.bat
install.sh		install.sh
main.py		main.py
requirements.txt		requirements.txt

License

isaiah76/Reviewer

Folders and files

Latest commit

History

Repository files navigation

Reviewer

Features

Project Structure

Requirements

Installation

Linux Installation

Windows Installation

Dependencies

Tesseract OCR

Poppler

LibreOffice or unoconv/unoserver (Optional tools for .ppt to .pptx conversion):

If the installation script didn't process the packages correctly, you can install them directly:

Environment Variables

Google Gemini API Key

Usage

Contributing

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages