Skip to content

GurKalra/InsightAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

InsightNet — AI Knowledge Discovery Engine

InsightNet is a local research/knowledge discovery engine that ingests documents (PDF/TXT/DOCX), extracts entities and topics, builds an entity co-occurrence graph, and provides an interactive visualization and simple analytics via a React frontend and a FastAPI backend.

This README explains what the project does, how it works, how to run it locally, and what technologies it uses.


Project overview

  • Upload documents to the backend. The pipeline extracts text and metadata.
  • An NLP pipeline extracts named entities and co-occurrences and scores them (confidence, corroboration, NER score, source credibility).
  • Topic modelling groups documents into topics/clusters.
  • A graph builder constructs an entity co-occurrence graph (NetworkX) and serves nodes/edges to the frontend.
  • The React frontend (Vite) renders the interactive knowledge canvas (D3), plus summary panels and a gap detector.

Tech stack

  • Backend: Python, FastAPI, SQLAlchemy, SQLite
  • NLP: spaCy (en_core_web_sm), sentence-transformers, BERTopic (topic modelling)
  • Graphs & scoring: NetworkX, custom ConfidenceScorer, GraphBuilder, GapDetector
  • Frontend: React + Vite, D3 for visualization
  • Storage: SQLite database at data/knowledge_engine.db

Preconditions / Prerequisites

  • Python 3.10+ (3.11/3.12 is OK)
  • Node.js (14+) and npm or yarn
  • System with enough memory if processing large docs

Install system-level dependencies if needed (Linux example):

# (Debian/Ubuntu) optional utilities
sudo apt update && sudo apt install -y build-essential libxml2-dev libxslt1-dev

Backend setup (recommended)

  1. Open a terminal, change into the backend folder:
cd backend
  1. Create and activate a virtual environment and install Python deps:
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
  1. Install the spaCy language model (required):
python -m spacy download en_core_web_sm
  1. Start the backend server (development):
uvicorn main:app --reload --host 0.0.0.0 --port 8000

Notes:

  • The backend expects to run with its working directory set to backend/ so imports and relative paths resolve correctly.
  • The SQLite DB file is data/knowledge_engine.db at the repo root.

Frontend setup

  1. In a separate terminal:
cd frontend
npm install
# run development server
npm run dev
# or build for production
npm run build
  1. Open the URL shown by Vite (usually http://localhost:5173) to interact with the UI.

API endpoints (important)

  • POST /upload — upload a document file (multipart/form-data). The server extracts text and processes the document.
  • GET /documents — list stored documents
  • GET /documents/{id} — fetch a document's metadata
  • DELETE /documents/{id} — remove a document
  • GET /graph — returns nodes and edges for visualization; nodes include dynamic summary fields: confidence, corroboration, ner_score, source_credibility, source_docs.
  • GET /gaps — returns detected research gaps with gap_score (now capped at 0..1, UI displays as percentage)
  • GET /topics — topic clusters
  • GET /graph/timeline?year_max=YYYY — timeline-filtered graph

Example quick test (after backend is running):

curl http://localhost:8000/graph | jq .nodes[0]
curl http://localhost:8000/gaps | jq .[0]

How it works (pipeline / flow)

  1. Document ingestion

    • User uploads a file via the frontend or POST /upload.
    • DataExtractor extracts text and basic metadata (filename, year).
  2. NLP processing

    • NLPProcessor (spaCy) tokenizes the text and extracts named entities (PERSON, ORG, GPE, PRODUCT, etc.).
    • Co-occurring entities inside the same sentences are detected as entity pairs.
  3. Scoring & corroboration

    • ConfidenceScorer computes a numeric confidence for entities and entity pairs using label priors (NER label quality), corroboration (how many docs mention them) and a source_credibility factor.
    • Corroboration is a simple function of frequency (e.g., min(freq/10, 1.0)).
  4. Topic modelling & summarization

    • TopicModeler groups documents with BERTopic; topics and keywords are stored.
    • Summarizer produces short summaries for documents or entity neighborhoods.
  5. Graph construction

    • GraphBuilder collects edges (entity co-occurrences) and constructs an in-memory NetworkX graph.
    • Nodes are annotated with attributes (type, first_seen_year, avg_confidence). When GET /graph is called the backend also computes and returns dynamic summary fields per node (confidence, corroboration, ner_score, source_credibility).
  6. Persistence

    • Documents, entities, topics, and edges are persisted in the SQLite database so the graph can be reconstructed on restart.
  7. Frontend

    • React app fetches /graph and renders nodes/edges with D3.
    • Clicking a node shows the Entity Summary (confidence etc.). The GapDetector highlights high-scoring entity pairs that never co-occur.

Troubleshooting & tips

  • If nodes show missing summary fields: ensure backend is running and that you rebuilt the graph after uploading (the backend automatically rebuilds after POST /upload).
  • If spaCy complains about model not found: run python -m spacy download en_core_web_sm inside your backend venv.
  • If frontend doesn't reflect changes, rebuild (npm run build) or run dev server and hard-refresh the browser.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors