Scipher: Research Paper Processing Platform

A full-stack application for uploading, processing, and analyzing research papers with AI-powered extraction and summarization capabilities.

Overview

Scipher is a modern web application that simplifies research paper processing through:

PDF Upload & Processing: Drag-and-drop interface with background processing
Content Extraction: Converts PDFs to structured markdown using Docling
NER Extraction: Automatically extracts glossary terms using spaCy
Real-time Status: Live updates during document processing
Responsive Design: Works on desktop and mobile devices

Architecture

System Components

graph TB
    subgraph "Frontend (Next.js)"
        UI["Web Interface<br/>localhost:3000"]
        Upload["DocumentUploader"]
        Viewer["DocumentViewer"]
        Status["ProcessingStatus"]
    end
    
    subgraph "Backend (FastAPI)"
        API["REST API<br/>localhost:8080"]
        Routes["API Routes"]
        Processor["DocumentProcessor"]
        NER["NERExtractor"]
    end
    
    subgraph "Background Processing"
        BGTasks["BackgroundTasks"]
        ThreadPool["ThreadPoolExecutor"]
        Docling["Docling Converter"]
        SpaCy["spaCy NER"]
    end
    
    subgraph "Storage"
        DB["SQLite Database"]
        FS["File System<br/>uploads/, processed/"]
    end
    
    UI --> API
    Upload --> Routes
    Viewer --> Routes
    Status --> Routes
    
    Routes --> Processor
    Processor --> BGTasks
    BGTasks --> ThreadPool
    ThreadPool --> Docling
    ThreadPool --> SpaCy
    
    Processor --> DB
    Processor --> FS

Technology Stack

Backend 1 :

FastAPI (Web framework)
SQLAlchemy + aiosqlite (Database ORM)
Docling (PDF processing)
spaCy (Named Entity Recognition)
uvicorn (ASGI server)

Frontend 2 :

Next.js 15.5.4 (React framework)
TypeScript
Tailwind CSS
Axios (HTTP client)

Quick Start

Prerequisites

Python ≥3.13
Node.js ≥18.18.0
uv package manager

Installation

Clone the repository:

git clone https://github.com/Arynshr/Scipher.git
cd Scipher

Backend setup:

cd backend
uv sync
source .venv/bin/activate
python -m spacy download en_core_web_sm

Frontend setup:

cd ../Web
npm install

Start the application:

cd ..  # Return to project root
chmod +x start-dev.sh
./start-dev.sh

The startup script 3 handles:

Cache cleanup
Backend startup on port 8080
Frontend startup on port 3000
Graceful shutdown on Ctrl+C

Access Points

Frontend: http://localhost:3000
Backend API: http://localhost:8080
API Documentation: http://localhost:8080/docs

Configuration

Backend Settings

Configuration is managed through Settings class 4 :

Setting	Default	Description
`DATABASE_URL`	`sqlite+aiosqlite:///./scipher.db`	Database connection
`UPLOAD_DIR`	`uploads/`	PDF upload directory
`MAX_FILE_SIZE`	50MB	Maximum upload size
`ALLOWED_EXTENSIONS`	`.pdf`	Permitted file types
`PROCESSING_TIMEOUT`	180s	Processing timeout

Frontend Environment

Create Web/.env.local:

NEXT_PUBLIC_API_URL=http://localhost:8080

API Endpoints

Document Management

Endpoint	Method	Description
`/api/upload`	POST	Upload PDF document
`/api/status/{doc_id}`	GET	Check processing status
`/api/document/{doc_id}`	GET	Get processed document
`/api/document/{doc_id}/sections`	GET	Get document sections
`/api/document/{doc_id}/text`	GET	Get raw text

Glossary Management

Endpoint	Method	Description
`/api/document/{doc_id}/glossary`	GET	Get glossary terms
`/api/document/{doc_id}/glossary`	POST	Add manual term
`/api/document/{doc_id}/glossary/{term_id}`	DELETE	Delete term
`/api/document/{doc_id}/glossary/extract`	POST	Re-extract terms

Document Processing Pipeline

Background Processing Architecture

The system uses FastAPI's BackgroundTasks for asynchronous processing 5 :

Upload: File saved to uploads/ directory
Queue: Background task queued with document ID
Processing: PDF → Markdown conversion via Docling
Extraction: NER processing with spaCy
Storage: Results saved to database and processed/ directory

Processing Flow

sequenceDiagram
    participant Client
    participant API as Upload API
    participant BG as BackgroundTasks
    participant Proc as DocumentProcessor
    participant DB as Database
    
    Client->>API: POST /api/upload (PDF)
    API->>DB: CREATE Document (UPLOADED)
    API->>BG: add_task(process_document)
    API-->>Client: 200 OK {doc_id, status}
    
    Note over BG,Proc: Async processing
    BG->>Proc: process_document(doc_id)
    Proc->>DB: UPDATE status=PROCESSING
    Proc->>Proc: Convert PDF (Docling)
    Proc->>Proc: Extract entities (spaCy)
    Proc->>DB: INSERT glossary terms
    Proc->>DB: UPDATE status=COMPLETED

Development

Project Structure

Scipher/
├── backend/                 # FastAPI application
│   ├── src/scipher/        # Source code
│   │   ├── api/           # API routes
│   │   ├── core/          # Processing services
│   │   ├── models/        # Data models
│   │   └── config.py      # Configuration
│   ├── pyproject.toml     # Dependencies
│   └── main.py           # Application entry
├── Web/                   # Next.js frontend
│   ├── app/              # App router pages
│   ├── components/       # React components
│   ├── utils/           # Utilities
│   └── package.json     # Dependencies
└── start-dev.sh         # Development script

Running Tests

# Backend tests
cd backend
source .venv/bin/activate
pytest

# Frontend tests
cd Web
npm test

Code Quality

The project uses:

TypeScript for frontend type safety
Pydantic for backend data validation
SQLAlchemy for database ORM
Async/await patterns throughout

Deployment

Production Build

Backend:

cd backend
uv sync --no-dev
PYTHONPATH=src uvicorn main:app --host 0.0.0.0 --port 8080

Frontend:

cd Web
npm run build
npm start

Environment Variables

Production requires:

DATABASE_URL: Production database connection
CORS_ORIGINS: Allowed frontend origins
NEXT_PUBLIC_API_URL: Backend API URL

Troubleshooting

Common Issues

Backend fails to start:

cd backend
source .venv/bin/activate
uv sync --reinstall

spaCy model not found:

python -m spacy download en_core_web_sm

Port conflicts:

lsof -i :8080  # Backend
lsof -i :3000  # Frontend

Cache Cleanup

The startup script automatically clears caches 6 :

Python __pycache__ directories
Next.js .next build artifacts
npm cache

Contributing

Fork the repository
Create a feature branch
Make changes with tests
Submit a pull request

License

This project is licensed under the MIT License.

Notes

This README is generated based on the current codebase structure and configuration. For the most up-to-date information, refer to the individual component documentation and inline code comments.

Wiki pages you might want to explore:

Citations

File: backend/pyproject.toml (L10-32)

dependencies = [
    "aiofiles>=24.1.0",
    "aiohttp>=3.13.0",
    "aiosqlite>=0.21.0",
    "alembic>=1.16.5",
    "docling>=2.58.0",
    "fastapi>=0.118.0",
    "fitz>=0.0.1.dev2",
    "hatchling>=1.27.0",
    "marker-pdf>=1.10.1",
    "onnxruntime>=1.23.2",
    "opencv-python-headless>=4.11.0.86",
    "pikepdf>=10.0.0",
    "pydantic>=2.11.9",
    "pydantic-settings>=2.11.0",
    "pymupdf>=1.26.5",
    "python-multipart>=0.0.20",
    "rapidocr>=3.4.2",
    "setuptools>=80.9.0",
    "spacy>=3.7.0",
    "sqlalchemy>=2.0.43",
    "uvicorn>=0.37.0",
]

File: start-dev.sh (L1-90)

#!/bin/bash

# ==========================================
# 🚀 Scipher Development Startup Script
# Starts both FastAPI (backend) and Next.js (Web)
# ==========================================

set -e  # Exit immediately on error

echo "🚀 Starting Scipher Development Environment"
echo "=========================================="

# Ensure we're in the project root
if [ ! -d "backend" ] || [ ! -d "Web" ]; then
    echo "❌ Error: Please run this script from the scipher project root directory."
    exit 1
fi

# --- Cache Cleanup Function ---
clean_caches() {
    echo "🧹 Cleaning caches for a fresh start..."
    
    # Backend: Python caches
    echo "  📦 Clearing Python caches..."
    find backend -name "__pycache__" -type d -exec rm -rf {} + 2>/dev/null || true
    find backend -name "*.pyc" -delete 2>/dev/null || true
    find backend -name "*.pyo" -delete 2>/dev/null || true
    
    # Backend: App-specific caches (e.g., processed data dir if exists)
    if [ -d "backend/src/processed_data" ]; then  # Adjust path based on settings.PROCESSED_DATA_DIR
        echo "  📁 Clearing processed data cache..."
        rm -rf backend/processed/*.json 2>/dev/null || true
    fi
    
    # Frontend: Next.js and npm caches
    echo "  🖥️  Clearing Next.js and npm caches..."
    cd Web
    rm -rf .next 2>/dev/null || true
    rm -rf node_modules/.cache 2>/dev/null || true
    npm cache clean --force 2>/dev/null || true
    cd ..
    
    # General: Any temp dirs (optional: add more as needed)
    rm -rf /tmp/scipher_* 2>/dev/null || true  # Temp files if used
    
    echo "✅ Caches cleared!"
}

# --- Backend startup ---
start_backend() {
    echo "📡 Starting FastAPI Backend..."
    cd backend

    # Activate venv if it exists (uv-compatible)
    if [ -d ".venv" ]; then
        echo "📦 Activating virtual environment..."
        source .venv/bin/activate
    fi

    # Install dependencies (prefer uv if available)
    if command -v uv &> /dev/null; then
        echo "📥 Installing Python dependencies via uv..."
        uv sync
    elif [ -f "requirements.txt" ]; then
        echo "📥 Installing Python dependencies via pip..."
        pip install -r requirements.txt
    fi

    echo "🚀 Starting backend server at http://localhost:8080"
    PYTHONPATH=src uvicorn main:app --reload --host 0.0.0.0 --port 8080 &
    BACKEND_PID=$!
    cd ..
    echo "✅ Backend started (PID: $BACKEND_PID)"
}

# --- Web (Next.js) startup ---
start_web() {
    echo "🖥️  Starting Next.js Web Frontend..."
    cd Web

    if [ ! -d "node_modules" ]; then
        echo "📦 Installing Node.js dependencies..."
        npm install
    fi

    echo "🚀 Starting frontend server at http://localhost:3000"
    npm run dev &
    FRONTEND_PID=$!
    cd ..
    echo "✅ Web frontend started (PID: $FRONTEND_PID)"

File: backend/src/scipher/config.py (L6-31)

class Settings(BaseSettings):
    APP_NAME: str = "Scipher API"
    APP_VERSION: str = "1.0.0"
    DEBUG: bool = True
    
    HOST: str = "0.0.0.0"
    PORT: int = 8080
    
    DATABASE_URL: str = "sqlite+aiosqlite:///./scipher.db"
    DB_ECHO: bool = False
    
    UPLOAD_DIR: Path = Path("uploads")
    MAX_FILE_SIZE: int = 50 * 1024 * 1024  # 50MB
    ALLOWED_EXTENSIONS: Set[str] = {".pdf"}
    
    CORS_ORIGINS: list[str] = ["http://localhost:3000", "http://localhost:3001"]
    
    PROCESSING_TIMEOUT: int = 180
    
    PROCESSED_DATA_DIR: Path = Path("processed")
    TEMP_DIR: Path = Path("temp")
    
    model_config = SettingsConfigDict(
        env_file=".env",
        case_sensitive=True
    )

File: backend/src/scipher/core/document_processor.py (L303-319)

    async def summarize_document(self, doc_id: str, summarizer: Optional[DocumentSummarizer] = None) -> SummaryResult:
        """Generate summaries for a processed document."""
        markdown_text = self.load_markdown_file(doc_id)
        if not markdown_text:
            raise ProcessingException("No processed markdown available for summarization")

        summarizer = summarizer or self.summarizer
        loop = asyncio.get_event_loop()
        try:
            result = await loop.run_in_executor(None, summarizer.summarize, markdown_text)
        except ValueError as exc:
            raise ProcessingException(str(exc)) from exc
        return result

# Singleton instance
document_processor = DocumentProcessor()

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.vscode		.vscode
Web		Web
backend		backend
model_evals		model_evals
LICENSE		LICENSE
Readme.md		Readme.md
start-dev.sh		start-dev.sh

Folders and files

Latest commit

History

Repository files navigation

Scipher: Research Paper Processing Platform

Overview

Architecture

System Components

Technology Stack

Quick Start

Prerequisites

Installation

Access Points

Configuration

Backend Settings

Frontend Environment

API Endpoints

Document Management

Glossary Management

Document Processing Pipeline

Background Processing Architecture

Processing Flow

Development

Project Structure

Running Tests

Code Quality

Deployment

Production Build

Environment Variables

Troubleshooting

Common Issues

Cache Cleanup

Contributing

License

Notes

Citations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages