A full-stack application for uploading, processing, and analyzing research papers with AI-powered extraction and summarization capabilities.
Scipher is a modern web application that simplifies research paper processing through:
- PDF Upload & Processing: Drag-and-drop interface with background processing
- Content Extraction: Converts PDFs to structured markdown using Docling
- NER Extraction: Automatically extracts glossary terms using spaCy
- Real-time Status: Live updates during document processing
- Responsive Design: Works on desktop and mobile devices
graph TB
subgraph "Frontend (Next.js)"
UI["Web Interface<br/>localhost:3000"]
Upload["DocumentUploader"]
Viewer["DocumentViewer"]
Status["ProcessingStatus"]
end
subgraph "Backend (FastAPI)"
API["REST API<br/>localhost:8080"]
Routes["API Routes"]
Processor["DocumentProcessor"]
NER["NERExtractor"]
end
subgraph "Background Processing"
BGTasks["BackgroundTasks"]
ThreadPool["ThreadPoolExecutor"]
Docling["Docling Converter"]
SpaCy["spaCy NER"]
end
subgraph "Storage"
DB["SQLite Database"]
FS["File System<br/>uploads/, processed/"]
end
UI --> API
Upload --> Routes
Viewer --> Routes
Status --> Routes
Routes --> Processor
Processor --> BGTasks
BGTasks --> ThreadPool
ThreadPool --> Docling
ThreadPool --> SpaCy
Processor --> DB
Processor --> FS
Backend 1 :
- FastAPI (Web framework)
- SQLAlchemy + aiosqlite (Database ORM)
- Docling (PDF processing)
- spaCy (Named Entity Recognition)
- uvicorn (ASGI server)
Frontend 2 :
- Next.js 15.5.4 (React framework)
- TypeScript
- Tailwind CSS
- Axios (HTTP client)
- Python ≥3.13
- Node.js ≥18.18.0
- uv package manager
- Clone the repository:
git clone https://github.com/Arynshr/Scipher.git
cd Scipher- Backend setup:
cd backend
uv sync
source .venv/bin/activate
python -m spacy download en_core_web_sm- Frontend setup:
cd ../Web
npm install- Start the application:
cd .. # Return to project root
chmod +x start-dev.sh
./start-dev.shThe startup script 3 handles:
- Cache cleanup
- Backend startup on port 8080
- Frontend startup on port 3000
- Graceful shutdown on Ctrl+C
- Frontend: http://localhost:3000
- Backend API: http://localhost:8080
- API Documentation: http://localhost:8080/docs
Configuration is managed through Settings class 4 :
| Setting | Default | Description |
|---|---|---|
DATABASE_URL |
sqlite+aiosqlite:///./scipher.db |
Database connection |
UPLOAD_DIR |
uploads/ |
PDF upload directory |
MAX_FILE_SIZE |
50MB | Maximum upload size |
ALLOWED_EXTENSIONS |
.pdf |
Permitted file types |
PROCESSING_TIMEOUT |
180s | Processing timeout |
Create Web/.env.local:
NEXT_PUBLIC_API_URL=http://localhost:8080| Endpoint | Method | Description |
|---|---|---|
/api/upload |
POST | Upload PDF document |
/api/status/{doc_id} |
GET | Check processing status |
/api/document/{doc_id} |
GET | Get processed document |
/api/document/{doc_id}/sections |
GET | Get document sections |
/api/document/{doc_id}/text |
GET | Get raw text |
| Endpoint | Method | Description |
|---|---|---|
/api/document/{doc_id}/glossary |
GET | Get glossary terms |
/api/document/{doc_id}/glossary |
POST | Add manual term |
/api/document/{doc_id}/glossary/{term_id} |
DELETE | Delete term |
/api/document/{doc_id}/glossary/extract |
POST | Re-extract terms |
The system uses FastAPI's BackgroundTasks for asynchronous processing 5 :
- Upload: File saved to
uploads/directory - Queue: Background task queued with document ID
- Processing: PDF → Markdown conversion via Docling
- Extraction: NER processing with spaCy
- Storage: Results saved to database and
processed/directory
sequenceDiagram
participant Client
participant API as Upload API
participant BG as BackgroundTasks
participant Proc as DocumentProcessor
participant DB as Database
Client->>API: POST /api/upload (PDF)
API->>DB: CREATE Document (UPLOADED)
API->>BG: add_task(process_document)
API-->>Client: 200 OK {doc_id, status}
Note over BG,Proc: Async processing
BG->>Proc: process_document(doc_id)
Proc->>DB: UPDATE status=PROCESSING
Proc->>Proc: Convert PDF (Docling)
Proc->>Proc: Extract entities (spaCy)
Proc->>DB: INSERT glossary terms
Proc->>DB: UPDATE status=COMPLETED
Scipher/
├── backend/ # FastAPI application
│ ├── src/scipher/ # Source code
│ │ ├── api/ # API routes
│ │ ├── core/ # Processing services
│ │ ├── models/ # Data models
│ │ └── config.py # Configuration
│ ├── pyproject.toml # Dependencies
│ └── main.py # Application entry
├── Web/ # Next.js frontend
│ ├── app/ # App router pages
│ ├── components/ # React components
│ ├── utils/ # Utilities
│ └── package.json # Dependencies
└── start-dev.sh # Development script
# Backend tests
cd backend
source .venv/bin/activate
pytest
# Frontend tests
cd Web
npm testThe project uses:
- TypeScript for frontend type safety
- Pydantic for backend data validation
- SQLAlchemy for database ORM
- Async/await patterns throughout
- Backend:
cd backend
uv sync --no-dev
PYTHONPATH=src uvicorn main:app --host 0.0.0.0 --port 8080- Frontend:
cd Web
npm run build
npm startProduction requires:
DATABASE_URL: Production database connectionCORS_ORIGINS: Allowed frontend originsNEXT_PUBLIC_API_URL: Backend API URL
Backend fails to start:
cd backend
source .venv/bin/activate
uv sync --reinstallspaCy model not found:
python -m spacy download en_core_web_smPort conflicts:
lsof -i :8080 # Backend
lsof -i :3000 # FrontendThe startup script automatically clears caches 6 :
- Python
__pycache__directories - Next.js
.nextbuild artifacts - npm cache
- Fork the repository
- Create a feature branch
- Make changes with tests
- Submit a pull request
This project is licensed under the MIT License.
This README is generated based on the current codebase structure and configuration. For the most up-to-date information, refer to the individual component documentation and inline code comments.
Wiki pages you might want to explore:
- Getting Started (Arynshr/Scipher)
- Glossary Extraction (NER) (Arynshr/Scipher)
- Background Processing (Arynshr/Scipher)
File: backend/pyproject.toml (L10-32)
dependencies = [
"aiofiles>=24.1.0",
"aiohttp>=3.13.0",
"aiosqlite>=0.21.0",
"alembic>=1.16.5",
"docling>=2.58.0",
"fastapi>=0.118.0",
"fitz>=0.0.1.dev2",
"hatchling>=1.27.0",
"marker-pdf>=1.10.1",
"onnxruntime>=1.23.2",
"opencv-python-headless>=4.11.0.86",
"pikepdf>=10.0.0",
"pydantic>=2.11.9",
"pydantic-settings>=2.11.0",
"pymupdf>=1.26.5",
"python-multipart>=0.0.20",
"rapidocr>=3.4.2",
"setuptools>=80.9.0",
"spacy>=3.7.0",
"sqlalchemy>=2.0.43",
"uvicorn>=0.37.0",
]
File: start-dev.sh (L1-90)
#!/bin/bash
# ==========================================
# 🚀 Scipher Development Startup Script
# Starts both FastAPI (backend) and Next.js (Web)
# ==========================================
set -e # Exit immediately on error
echo "🚀 Starting Scipher Development Environment"
echo "=========================================="
# Ensure we're in the project root
if [ ! -d "backend" ] || [ ! -d "Web" ]; then
echo "❌ Error: Please run this script from the scipher project root directory."
exit 1
fi
# --- Cache Cleanup Function ---
clean_caches() {
echo "🧹 Cleaning caches for a fresh start..."
# Backend: Python caches
echo " 📦 Clearing Python caches..."
find backend -name "__pycache__" -type d -exec rm -rf {} + 2>/dev/null || true
find backend -name "*.pyc" -delete 2>/dev/null || true
find backend -name "*.pyo" -delete 2>/dev/null || true
# Backend: App-specific caches (e.g., processed data dir if exists)
if [ -d "backend/src/processed_data" ]; then # Adjust path based on settings.PROCESSED_DATA_DIR
echo " 📁 Clearing processed data cache..."
rm -rf backend/processed/*.json 2>/dev/null || true
fi
# Frontend: Next.js and npm caches
echo " 🖥️ Clearing Next.js and npm caches..."
cd Web
rm -rf .next 2>/dev/null || true
rm -rf node_modules/.cache 2>/dev/null || true
npm cache clean --force 2>/dev/null || true
cd ..
# General: Any temp dirs (optional: add more as needed)
rm -rf /tmp/scipher_* 2>/dev/null || true # Temp files if used
echo "✅ Caches cleared!"
}
# --- Backend startup ---
start_backend() {
echo "📡 Starting FastAPI Backend..."
cd backend
# Activate venv if it exists (uv-compatible)
if [ -d ".venv" ]; then
echo "📦 Activating virtual environment..."
source .venv/bin/activate
fi
# Install dependencies (prefer uv if available)
if command -v uv &> /dev/null; then
echo "📥 Installing Python dependencies via uv..."
uv sync
elif [ -f "requirements.txt" ]; then
echo "📥 Installing Python dependencies via pip..."
pip install -r requirements.txt
fi
echo "🚀 Starting backend server at http://localhost:8080"
PYTHONPATH=src uvicorn main:app --reload --host 0.0.0.0 --port 8080 &
BACKEND_PID=$!
cd ..
echo "✅ Backend started (PID: $BACKEND_PID)"
}
# --- Web (Next.js) startup ---
start_web() {
echo "🖥️ Starting Next.js Web Frontend..."
cd Web
if [ ! -d "node_modules" ]; then
echo "📦 Installing Node.js dependencies..."
npm install
fi
echo "🚀 Starting frontend server at http://localhost:3000"
npm run dev &
FRONTEND_PID=$!
cd ..
echo "✅ Web frontend started (PID: $FRONTEND_PID)"
File: backend/src/scipher/config.py (L6-31)
class Settings(BaseSettings):
APP_NAME: str = "Scipher API"
APP_VERSION: str = "1.0.0"
DEBUG: bool = True
HOST: str = "0.0.0.0"
PORT: int = 8080
DATABASE_URL: str = "sqlite+aiosqlite:///./scipher.db"
DB_ECHO: bool = False
UPLOAD_DIR: Path = Path("uploads")
MAX_FILE_SIZE: int = 50 * 1024 * 1024 # 50MB
ALLOWED_EXTENSIONS: Set[str] = {".pdf"}
CORS_ORIGINS: list[str] = ["http://localhost:3000", "http://localhost:3001"]
PROCESSING_TIMEOUT: int = 180
PROCESSED_DATA_DIR: Path = Path("processed")
TEMP_DIR: Path = Path("temp")
model_config = SettingsConfigDict(
env_file=".env",
case_sensitive=True
)File: backend/src/scipher/core/document_processor.py (L303-319)
async def summarize_document(self, doc_id: str, summarizer: Optional[DocumentSummarizer] = None) -> SummaryResult:
"""Generate summaries for a processed document."""
markdown_text = self.load_markdown_file(doc_id)
if not markdown_text:
raise ProcessingException("No processed markdown available for summarization")
summarizer = summarizer or self.summarizer
loop = asyncio.get_event_loop()
try:
result = await loop.run_in_executor(None, summarizer.summarize, markdown_text)
except ValueError as exc:
raise ProcessingException(str(exc)) from exc
return result
# Singleton instance
document_processor = DocumentProcessor()