DataMind — AI-Assisted Data Research & Development System

A fully local, AI-powered Exploratory Data Analysis platform that transforms raw multi-table datasets into structured research, automated roadmaps, and collaborative ML project plans — no cloud API required.

Overview

DataMind is a production-grade, AI-augmented EDA tool built on Streamlit. It solves a fundamental problem in data science projects: bridging the gap between raw datasets and a structured, decision-ready project plan.

Traditional EDA tools stop at statistics and charts. DataMind goes several steps further — it understands your data, infers the ML task type, detects anomalies, curates relevant research, generates a phased development roadmap, and collaborates with you through a local LLM chat assistant — all without sending a single byte to an external server.

The system is designed around a 10-section navigation pipeline that guides a data scientist from raw dataset upload through profiling, quality assurance, context understanding, research discovery, project planning, and finally, interactive AI collaboration.

Features

Multi-Table ZIP Dataset Ingestion — Upload any ZIP archive containing one or more CSV files; all tables are loaded and co-indexed automatically
Dataset Overview — Row/column counts, memory usage, numeric vs. categorical column breakdown, and missing-value summary per table
Schema & Relationship Detection — Auto-detects shared columns across tables, infers foreign key relationships using primary-key uniqueness and value-overlap analysis (≥95% threshold)
Relationship Graph Visualisation — Generates a directed NetworkX graph of FK relationships and saves a publication-quality PNG
Advanced Data Quality & Anomaly Detection — Seven independent checks per column/table: IQR outliers, Z-score outliers, impossible negatives, round-number bias, constant columns, near-unique categoricals, and missing value patterns; each issue is severity-scored (high / medium / low) and contributes to a 0–100 quality score
Context & Task Type Inference — Heuristic signal matching (60+ regex patterns) plus LLM-based JSON-structured analysis to identify task type (Classification, Regression, Time-Series, Clustering), domain (Healthcare, Finance, Transport, IoT, E-Commerce, Human Activity), and target variable candidates
Research Discovery — Offline curated resource library (papers, libraries, Kaggle notebooks) matched to the inferred task type, enriched with LLM-generated project-specific relevance explanations
Research Note Submission & Feedback — Users write research summaries; the LLM reviews their understanding and provides targeted feedback
AI-Generated 6-Phase Project Roadmap — A structured development plan spanning Data Quality & Preprocessing → Feature Engineering → Model Development → System Design → UI & Integration → Monitoring, with LLM-generated phase-specific guidance tailored to the actual dataset and task
Progress Tracking — Interactive step completion with a visual progress bar persisted across sessions
Statistical Insights — Pearson correlation detection (|r| > 0.65), IQR-based outlier summaries, distribution skewness detection, and coefficient-of-variation checks
Time Column Detection — Keyword-based and parse-rate-based detection of datetime columns with confidence scoring
Feature Suggestion Engine — Domain-aware formula-based feature engineering suggestions (trip duration, fare per minute, average speed, earnings velocity, rolling means)
Distribution Visualisations — KDE-overlaid histograms (Seaborn) for every numeric column, saved as 300 DPI PNGs
ydata-profiling HTML Reports — Full exploratory profile reports per table with Pearson correlation, interactions, and missing value diagrams (configurable for performance)
Collaborative AI Chat Assistant — Multi-turn conversation backed by a local Ollama model with dataset-aware and roadmap-aware system prompting, quick-start buttons, and scrollable chat UI
Ollama Health Monitoring — Live connection status indicator and local model picker in the sidebar

System Architecture

DataMind follows a clean layered architecture with strict separation between UI, processing, inference, and I/O concerns.

┌──────────────────────────────────────────────────────────────────────┐
│                     USER INTERFACE LAYER                             │
│                  Streamlit (app.py — 1,021 lines)                    │
│  Sidebar Navigation · Section Rendering · Session State Management   │
│  Custom CSS Design System (DM Mono · Syne · Inter · Dark Theme)      │
└─────────────────────────────┬────────────────────────────────────────┘
                              │
          ┌───────────────────┼───────────────────┐
          ▼                   ▼                   ▼
┌─────────────────┐  ┌────────────────┐  ┌──────────────────────────┐
│  DATA LOADING   │  │  STATISTICAL   │  │    AI / LLM LAYER        │
│    LAYER        │  │  ANALYSIS      │  │                          │
│                 │  │    LAYER       │  │  llm_client.py           │
│  loader.py      │  │                │  │  ├── ask() / ask_json()  │
│  ├── ZIP unpack │  │  stats_        │  │  ├── chat_turn()         │
│  ├── CSV parse  │  │  analyzer.py   │  │  ├── health_check()      │
│  └── name clean │  │  insight_      │  │  └── list_models()       │
│                 │  │  generator.py  │  │                          │
└─────────────────┘  │  time_series_  │  │  context_analyzer.py    │
                     │  detector.py   │  │  research_agent.py      │
                     │  feature_      │  │  project_planner.py     │
                     │  suggester.py  │  │  collaborative_agent.py │
                     └────────────────┘  └──────────────────────────┘
          │                   │                   │
          └───────────────────┼───────────────────┘
                              ▼
              ┌───────────────────────────────┐
              │   SCHEMA & QUALITY LAYER      │
              │                               │
              │  schema_detector.py           │
              │  fk_detector.py               │
              │  anomaly_detector.py          │
              └───────────────────────────────┘
                              │
                              ▼
              ┌───────────────────────────────┐
              │  VISUALISATION & REPORT LAYER │
              │                               │
              │  visualizer.py (Seaborn)      │
              │  relationship_graph.py (NetworkX)│
              │  profiling_reports.py (ydata) │
              │  multi_table_report.py (Jinja2)│
              └───────────────────────────────┘
                              │
                              ▼
              ┌───────────────────────────────┐
              │         OUTPUT LAYER          │
              │                               │
              │  outputs/plots/*.png          │
              │  outputs/profiles/*.html      │
              │  outputs/join_graph.png       │
              │  outputs/EDA_REPORT.html      │
              │  temp_data/ (uploaded ZIPs)   │
              └───────────────────────────────┘

Key Architectural Decisions

Decision	Rationale
Fully local LLM via Ollama	Zero cost, zero data leakage, works offline
Heuristic-first, LLM-second	Heuristics provide instant results; LLM enriches with nuance without being a hard dependency
Streamlit Session State	All computed artefacts (roadmap, anomaly report, resources, chat history) are cached in `st.session_state` to avoid redundant recomputation
ZIP-based multi-table input	Enables real-world multi-file dataset bundles in a single upload action
Modular Python files	Each analytical concern is isolated into its own module, making the codebase independently testable

Project Structure

EDA/
│
├── app.py                    # Main Streamlit application (entry point, UI, navigation)
│
├── loader.py                 # ZIP extraction and CSV loading pipeline
├── stats_analyzer.py         # Per-table statistics (rows, cols, memory, missing %)
├── schema_detector.py        # Shared-column schema detection across tables
├── fk_detector.py            # Foreign key relationship inference (primary key + overlap)
├── time_series_detector.py   # Datetime column detection (keyword + parse-rate)
├── feature_suggester.py      # Domain-aware feature engineering formula suggestions
├── insight_generator.py      # Correlation, outlier, skew, and variability insights
├── visualizer.py             # Seaborn KDE+histogram plots saved as PNG
├── relationship_graph.py     # NetworkX directed graph + Matplotlib visualisation
├── profiling_reports.py      # ydata-profiling HTML report generation
├── multi_table_report.py     # Jinja2 HTML summary report (multi-table overview)
│
├── anomaly_detector.py       # 7-check data quality engine with 0-100 quality scoring
├── context_analyzer.py       # Task type + domain inference (heuristic + LLM JSON)
├── research_agent.py         # Curated offline resource library + LLM relevance enrichment
├── project_planner.py        # 6-phase static roadmap scaffold + LLM phase guidance
├── collaborative_agent.py    # Multi-turn LLM chat with dataset-aware system prompting
├── llm_client.py             # Ollama REST client (ask, chat_turn, ask_json, health_check)
│
├── outputs/                  # Auto-generated output artefacts
│   ├── plots/                #     Distribution histograms per numeric column
│   ├── profiles/             #     ydata-profiling HTML reports per table
│   ├── join_graph.png        #     Table relationship graph image
│   └── EDA_REPORT.html       #     Consolidated multi-table HTML summary
│
└── temp_data/                # Temporary storage for uploaded ZIP files

Module Descriptions

Module	Responsibility
`app.py`	Orchestrates all sections, manages Streamlit session state, renders custom CSS design system
`loader.py`	Opens ZIP archives, discovers all CSV files, strips path prefixes for clean table names
`stats_analyzer.py`	Computes row counts, column types, memory footprint, missing value percentages, and descriptive statistics
`schema_detector.py`	Builds a column→tables inverted index to find columns shared across multiple tables
`fk_detector.py`	Applies primary-key uniqueness test + ≥95% value-overlap threshold to infer FK relationships
`anomaly_detector.py`	7 independent quality checks with per-severity deduction scoring system
`context_analyzer.py`	Two-stage inference: 60+ regex heuristics for instant results, then LLM for structured JSON enrichment
`research_agent.py`	Offline curated resource library across 5 task types; LLM generates project-specific "why relevant" captions
`project_planner.py`	6-phase, 18-step static roadmap with three step types (User Decision, Collaborative, Automated); LLM injects tailored phase guidance
`collaborative_agent.py`	Builds dataset-aware + roadmap-aware system prompts; wraps multi-turn `chat_turn()` for the AI Assistant section
`llm_client.py`	Thin HTTP client over Ollama's REST API — `ask()`, `chat_turn()`, `ask_json()`, health check, model listing
`insight_generator.py`	Pearson correlation (
`time_series_detector.py`	Keyword matching + `pd.to_datetime` parse-rate (>70%) for datetime column detection
`feature_suggester.py`	Pattern-based domain feature ideas: trip duration, fare/min, average speed, earnings velocity, rolling means
`visualizer.py`	Seaborn `histplot` + KDE for every numeric column, 300 DPI output
`relationship_graph.py`	NetworkX `DiGraph` of FK relationships with Spring Layout and annotated edge labels
`profiling_reports.py`	`ydata_profiling.ProfileReport` with Pearson-only correlation and performance optimisations
`multi_table_report.py`	Jinja2-templated HTML report combining stats, schema, and generated plots

How It Works — Step-by-Step Pipeline

1. APP LAUNCH
   └── Streamlit initialises session state for all computed artefacts
   └── Sidebar: Ollama health check → model picker → section navigation

2. DATASET UPLOAD
   └── User uploads a ZIP file via st.file_uploader
   └── loader.py: extracts ZIP → discovers CSVs → parses with pandas → returns {table_name: DataFrame}
   └── All tables stored in st.session_state.dfs

3. OVERVIEW (Section 1)
   └── stats_analyzer.py: computes per-table statistics
   └── UI: metric cards (tables, rows, columns) + per-table expanders

4. SCHEMA ANALYSIS (Section 2)
   └── schema_detector.py: builds column → tables map
   └── fk_detector.py: tests uniqueness + overlap for FK inference
   └── relationship_graph.py: optional NetworkX graph PNG

5. DATA QUALITY (Section 3)
   └── anomaly_detector.py: runs 7 checks across all numeric + categorical columns
   └── quality_score(): computes 0-100 score per table (deductions by severity / col count)
   └── UI: severity-badged issues, per-severity metric counters, affected row indices

6. CONTEXT INFERENCE (Section 4)
   └── User enters optional problem statement
   └── context_analyzer.py:
       ├── Heuristic: 60+ regex patterns → task type + domain + target candidates
       └── LLM: ask_json() → structured JSON with approach, observations, caveats

7. RESEARCH DISCOVERY (Section 5)
   └── research_agent.py: selects resources from offline library by task type
   └── LLM enriches each resource with a project-specific "why relevant" sentence
   └── User writes research notes → LLM reviews and provides feedback

8. ROADMAP GENERATION (Section 6)
   └── project_planner.py: deep-copies 6-phase static scaffold
   └── LLM injects 2-sentence tailored guidance per phase
   └── User marks steps done → progress bar updates in sidebar

9. INSIGHTS (Section 7)
   └── insight_generator.py: correlation, outlier, skew, variability analysis
   └── time_series_detector.py: datetime column detection
   └── feature_suggester.py: domain-aware formula suggestions

10. VISUALISATIONS (Section 8)
    └── visualizer.py: Seaborn histplot + KDE per numeric column
    └── Output: outputs/plots/{table}_{col}.png

11. PROFILING REPORTS (Section 9)
    └── profiling_reports.py: ydata-profiling HTML per table
    └── Output: outputs/profiles/{table}_profile.html

12. AI ASSISTANT (Section 10)
    └── collaborative_agent.py builds system prompt with dataset schema + context + roadmap progress
    └── chat_turn(): full multi-turn conversation via Ollama /api/chat
    └── Quick-start prompts for common data science questions

Technologies Used

Category	Technology
UI Framework	Streamlit
LLM Runtime	Ollama (local, OpenAI-compatible REST API)
LLM Models	llama3.2, llama3.1, Mistral, Gemma3 (user-selectable)
Data Processing	pandas, NumPy
Statistical Analysis	pandas `.corr()`, `.describe()`, `.skew()`
Visualisation	Matplotlib, Seaborn
Graph Visualisation	NetworkX
Profiling	ydata-profiling (ProfileReport)
HTML Templating	Jinja2
HTTP Client	Python standard library `urllib`
Typography	Google Fonts — Inter, Syne, DM Mono

Installation

Prerequisites

Python 3.10 or higher
Ollama installed and running locally

1. Clone the Repository

git clone https://github.com/<your-username>/datamind-eda.git
cd datamind-eda

2. Install Python Dependencies

pip install -r requirements.txt

3. Start Ollama and Pull a Model

# Start the Ollama server
ollama serve

# In a separate terminal, pull a model (choose one)
ollama pull llama3.2      # Recommended — fast and capable
ollama pull mistral       # Alternative
ollama pull gemma3        # Alternative

4. Launch DataMind

streamlit run app.py

The app will open at http://localhost:8501 in your browser.

Dependencies

Package	Purpose
`streamlit`	Web UI framework
`pandas`	DataFrame manipulation
`numpy`	Numerical operations
`matplotlib`	Plot rendering
`seaborn`	Statistical visualisations
`networkx`	Graph construction and layout
`ydata-profiling`	Automated HTML profiling reports
`jinja2`	HTML report templating

Note: No external LLM API keys or cloud services are required. All AI features run entirely through the local Ollama runtime.

Usage Guide

Step 1 — Prepare Your Dataset

Package your CSV files into a single ZIP archive. Files may be in subdirectories; DataMind will discover all CSVs automatically and use the filename (without extension) as the table name.

my_dataset.zip
├── trips.csv
├── sensors/
│   └── accelerometer_data.csv
└── earnings.csv

Step 2 — Upload and Explore

Open the app at http://localhost:8501
Upload your ZIP using the file uploader
Navigate through the sidebar sections in order

Step 3 — Recommended Workflow

Follow the navigation sections in the intended sequence for the best experience:

Step	Section	Action
1	Overview	Review table shapes and missing values
2	Schema	Understand cross-table relationships; generate graph
3	Data Quality	Run quality analysis; note high-severity issues
4	Context	Enter your project goal; analyse task type and domain
5	Research	Discover resources; write and submit research notes
6	Roadmap	Generate your personalised roadmap; track progress
7	Insights	Review correlations, feature suggestions, and time columns
8	Visualisations	Generate distribution plots for all numeric columns
9	Profiling Reports	Generate full HTML profiles for deep-dive analysis
10	AI Assistant	Chat with your local AI for guidance on any step

Step 4 — AI Assistant

Use the quick-start prompts or type any question:

"What preprocessing steps should I prioritise?"
"Which model should I start with for this task?"
"Explain the most critical anomalies found"
"What features are likely most predictive?"

Supported Data Types

DataMind is designed for tabular, structured CSV data across a wide range of domains:

Domain	Typical Signals Detected
Transport / Mobility	`trip`, `driver`, `fare`, `distance`, `speed`, `route`, `gps`
IoT / Sensor	`sensor`, `accelerometer`, `gyroscope`, `temperature`, `pressure`, `audio`
Finance	`price`, `stock`, `transaction`, `fraud`, `credit`, `loan`
Healthcare	`patient`, `diagnosis`, `medication`, `heart_rate`, `blood`
E-Commerce / Retail	`product`, `order`, `cart`, `purchase`, `sku`, `inventory`
Human Activity	`activity`, `step`, `walk`, `run`, `gesture`, `posture`, `imu`

Dataset constraints:

File format: CSV (inside a ZIP archive)
Multi-table support: unlimited tables per ZIP
Column types: numeric (float64, int64) and categorical (object)
Datetime columns: automatically detected and excluded from numeric analysis

Visualisations & Analysis Outputs

Automated Charts

Output	Description	Format
Distribution Histograms	KDE-overlaid histogram for every numeric column in every table	PNG (300 DPI)
Relationship Graph	Directed FK graph with edge labels showing join keys	PNG (300 DPI)

Analytical Reports

Output	Description	Format
ydata Profiling Report	Full per-table EDA report with correlations, distributions, missing value analysis	HTML
Multi-Table Summary	Jinja2 HTML report combining all tables, schema, and plots	HTML

In-App Analysis Outputs

Section	What is Shown
Data Quality	7-check anomaly report per table, severity badges, quality score 0–100, affected row indices
Insights	Strong correlations (
Context	Task type, domain, confidence level, target variable candidates with scoring rationale
Research	Curated papers, libraries, and notebooks with LLM-generated project-specific relevance
Roadmap	6-phase, 18-step interactive development plan with phase guidance and step completion tracking

Performance Notes

Lazy computation: Each section computes its analysis only when the user navigates to it; nothing runs at upload time except the CSV loading
Session state caching: All expensive computations (anomaly scan, context inference, roadmap generation, resource discovery) are stored in st.session_state and never recomputed unless explicitly triggered again via the action button
LLM graceful degradation: Every LLM-dependent feature has a heuristic fallback. If Ollama is offline, the app continues to function fully with heuristic-only results; the AI Assistant and Research Feedback sections display an actionable offline warning
ydata-profiling optimised configuration: Spearman, Kendall, Phi-K, and Cramér's correlations are disabled; interaction plots and missing value heatmaps are disabled to reduce report generation time on large datasets
Plot memory management: plt.close() is called after every plot to prevent memory accumulation during bulk visualisation generation
Anomaly detection scaling: Quality checks are normalised by column count (quality_score = max(0, 100 - raw_deductions / n_cols)) so scores remain comparable across tables of different widths

Example Workflow

Here is a typical end-to-end session for a driver behaviour classification project:

Upload — User uploads driver_pulse_dataset.zip containing trips.csv, accelerometer_data.csv, and earnings.csv
Overview — DataMind reports 3 tables, 250,000 total rows, 47 columns. trips.csv has 12% missing values in end_time
Schema — FK relationship detected: trips.driver_id → earnings.driver_id (98% overlap). Relationship graph generated
Data Quality — accelerometer_data.x_axis has 340 IQR outliers (high severity). trips.fare shows round-number bias (67% multiples of 5). Overall quality score: 74/100
Context — User enters: "Predict driver churn from trip and sensor data". DataMind infers: Binary Classification · Transport/Mobility domain · High confidence. Target candidate: churn_flag (binary, name match)
Research — XGBoost, SHAP, and Imbalanced-learn surfaces as top resources. User notes: "XGBoost handles missing values natively — useful for the end_time gaps". LLM feedback validates the insight and suggests SMOTE for class imbalance
Roadmap — 6-phase plan generated. LLM guidance for Phase 1: "Focus KNN imputation for end_time given its 12% missingness rate and the strong FK relationship to the earnings table..."
Insights — Strong correlation between trip_distance and fare (r=0.87), x_axis is right-skewed (skew=2.3). Feature suggestion: average_speed = distance / duration
Visualisations — 23 distribution plots generated for all numeric columns
AI Assistant — User asks: "Should I use SMOTE or class weights for the imbalance?" — AI responds with dataset-specific advice referencing the actual column names and row counts

Future Scope

Automated feature importance ranking before model selection using permutation importance or mutual information
Time-series specific EDA module with autocorrelation plots, seasonality decomposition, and stationarity tests
Custom anomaly threshold configuration per column type via the UI
Export roadmap to PDF or Notion for team sharing and project tracking integration
Expanded resource library with domain-specific Kaggle notebook links and GitHub repositories
Multi-user session support with persistent project workspaces saved to disk or a lightweight database
Streaming LLM responses in the AI Assistant using Ollama's streaming API for real-time token display
Automated data cleaning suggestions with one-click application of recommended preprocessing steps

License

This project is licensed under the MIT License.

MIT License

Copyright (c) 2025

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Author

Roopanshu Guota
ML Engineer Learner

Built DataMind as part of a data science research initiative focused on AI-augmented exploratory analysis and automated ML project planning.

DataMind — From raw data to research-ready, locally and intelligently.

If you find this project useful, please consider giving it a star!

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
__pycache__		__pycache__
outputs		outputs
test_data		test_data
README.md		README.md
anomaly_detector.py		anomaly_detector.py
app.py		app.py
collaborative_agent.py		collaborative_agent.py
context_analyzer.py		context_analyzer.py
feature_suggester.py		feature_suggester.py
fk_detector.py		fk_detector.py
insight_generator.py		insight_generator.py
llm_client.py		llm_client.py
loader.py		loader.py
multi_table_report.py		multi_table_report.py
profiling_reports.py		profiling_reports.py
project_planner.py		project_planner.py
relationship_graph.py		relationship_graph.py
research_agent.py		research_agent.py
schema_detector.py		schema_detector.py
stats_analyzer.py		stats_analyzer.py
time_series_detector.py		time_series_detector.py
visualizer.py		visualizer.py

Folders and files

Latest commit

History

Repository files navigation

DataMind — AI-Assisted Data Research & Development System

Overview

Features

System Architecture

Key Architectural Decisions

Project Structure

Module Descriptions

How It Works — Step-by-Step Pipeline

Technologies Used

Installation

Prerequisites

1. Clone the Repository

2. Install Python Dependencies

3. Start Ollama and Pull a Model

4. Launch DataMind

Dependencies

Usage Guide

Step 1 — Prepare Your Dataset

Step 2 — Upload and Explore

Step 3 — Recommended Workflow

Step 4 — AI Assistant

Supported Data Types

Visualisations & Analysis Outputs

Automated Charts

Analytical Reports

In-App Analysis Outputs

Performance Notes

Example Workflow

Future Scope

License

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages