InsightNet is a local research/knowledge discovery engine that ingests documents (PDF/TXT/DOCX), extracts entities and topics, builds an entity co-occurrence graph, and provides an interactive visualization and simple analytics via a React frontend and a FastAPI backend.
This README explains what the project does, how it works, how to run it locally, and what technologies it uses.
- Upload documents to the backend. The pipeline extracts text and metadata.
- An NLP pipeline extracts named entities and co-occurrences and scores them (confidence, corroboration, NER score, source credibility).
- Topic modelling groups documents into topics/clusters.
- A graph builder constructs an entity co-occurrence graph (NetworkX) and serves nodes/edges to the frontend.
- The React frontend (Vite) renders the interactive knowledge canvas (D3), plus summary panels and a gap detector.
- Backend: Python, FastAPI, SQLAlchemy, SQLite
- NLP: spaCy (en_core_web_sm), sentence-transformers, BERTopic (topic modelling)
- Graphs & scoring: NetworkX, custom
ConfidenceScorer,GraphBuilder,GapDetector - Frontend: React + Vite, D3 for visualization
- Storage: SQLite database at
data/knowledge_engine.db
- Python 3.10+ (3.11/3.12 is OK)
- Node.js (14+) and npm or yarn
- System with enough memory if processing large docs
Install system-level dependencies if needed (Linux example):
# (Debian/Ubuntu) optional utilities
sudo apt update && sudo apt install -y build-essential libxml2-dev libxslt1-dev- Open a terminal, change into the
backendfolder:
cd backend- Create and activate a virtual environment and install Python deps:
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt- Install the spaCy language model (required):
python -m spacy download en_core_web_sm- Start the backend server (development):
uvicorn main:app --reload --host 0.0.0.0 --port 8000Notes:
- The backend expects to run with its working directory set to
backend/so imports and relative paths resolve correctly. - The SQLite DB file is
data/knowledge_engine.dbat the repo root.
- In a separate terminal:
cd frontend
npm install
# run development server
npm run dev
# or build for production
npm run build- Open the URL shown by Vite (usually
http://localhost:5173) to interact with the UI.
POST /upload— upload a document file (multipart/form-data). The server extracts text and processes the document.GET /documents— list stored documentsGET /documents/{id}— fetch a document's metadataDELETE /documents/{id}— remove a documentGET /graph— returnsnodesandedgesfor visualization; nodes include dynamic summary fields:confidence,corroboration,ner_score,source_credibility,source_docs.GET /gaps— returns detected research gaps withgap_score(now capped at 0..1, UI displays as percentage)GET /topics— topic clustersGET /graph/timeline?year_max=YYYY— timeline-filtered graph
Example quick test (after backend is running):
curl http://localhost:8000/graph | jq .nodes[0]
curl http://localhost:8000/gaps | jq .[0]-
Document ingestion
- User uploads a file via the frontend or
POST /upload. DataExtractorextracts text and basic metadata (filename, year).
- User uploads a file via the frontend or
-
NLP processing
NLPProcessor(spaCy) tokenizes the text and extracts named entities (PERSON, ORG, GPE, PRODUCT, etc.).- Co-occurring entities inside the same sentences are detected as entity pairs.
-
Scoring & corroboration
ConfidenceScorercomputes a numeric confidence for entities and entity pairs using label priors (NER label quality), corroboration (how many docs mention them) and asource_credibilityfactor.- Corroboration is a simple function of frequency (e.g., min(freq/10, 1.0)).
-
Topic modelling & summarization
TopicModelergroups documents with BERTopic; topics and keywords are stored.Summarizerproduces short summaries for documents or entity neighborhoods.
-
Graph construction
GraphBuildercollects edges (entity co-occurrences) and constructs an in-memory NetworkX graph.- Nodes are annotated with attributes (type, first_seen_year, avg_confidence). When
GET /graphis called the backend also computes and returns dynamic summary fields per node (confidence, corroboration, ner_score, source_credibility).
-
Persistence
- Documents, entities, topics, and edges are persisted in the SQLite database so the graph can be reconstructed on restart.
-
Frontend
- React app fetches
/graphand renders nodes/edges with D3. - Clicking a node shows the
Entity Summary(confidence etc.). TheGapDetectorhighlights high-scoring entity pairs that never co-occur.
- React app fetches
- If nodes show missing summary fields: ensure backend is running and that you rebuilt the graph after uploading (the backend automatically rebuilds after
POST /upload). - If spaCy complains about model not found: run
python -m spacy download en_core_web_sminside your backend venv. - If frontend doesn't reflect changes, rebuild (
npm run build) or run dev server and hard-refresh the browser.