A fully local, AI-powered Exploratory Data Analysis platform that transforms raw multi-table datasets into structured research, automated roadmaps, and collaborative ML project plans — no cloud API required.
DataMind is a production-grade, AI-augmented EDA tool built on Streamlit. It solves a fundamental problem in data science projects: bridging the gap between raw datasets and a structured, decision-ready project plan.
Traditional EDA tools stop at statistics and charts. DataMind goes several steps further — it understands your data, infers the ML task type, detects anomalies, curates relevant research, generates a phased development roadmap, and collaborates with you through a local LLM chat assistant — all without sending a single byte to an external server.
The system is designed around a 10-section navigation pipeline that guides a data scientist from raw dataset upload through profiling, quality assurance, context understanding, research discovery, project planning, and finally, interactive AI collaboration.
- Multi-Table ZIP Dataset Ingestion — Upload any ZIP archive containing one or more CSV files; all tables are loaded and co-indexed automatically
- Dataset Overview — Row/column counts, memory usage, numeric vs. categorical column breakdown, and missing-value summary per table
- Schema & Relationship Detection — Auto-detects shared columns across tables, infers foreign key relationships using primary-key uniqueness and value-overlap analysis (≥95% threshold)
- Relationship Graph Visualisation — Generates a directed NetworkX graph of FK relationships and saves a publication-quality PNG
- Advanced Data Quality & Anomaly Detection — Seven independent checks per column/table: IQR outliers, Z-score outliers, impossible negatives, round-number bias, constant columns, near-unique categoricals, and missing value patterns; each issue is severity-scored (high / medium / low) and contributes to a 0–100 quality score
- Context & Task Type Inference — Heuristic signal matching (60+ regex patterns) plus LLM-based JSON-structured analysis to identify task type (Classification, Regression, Time-Series, Clustering), domain (Healthcare, Finance, Transport, IoT, E-Commerce, Human Activity), and target variable candidates
- Research Discovery — Offline curated resource library (papers, libraries, Kaggle notebooks) matched to the inferred task type, enriched with LLM-generated project-specific relevance explanations
- Research Note Submission & Feedback — Users write research summaries; the LLM reviews their understanding and provides targeted feedback
- AI-Generated 6-Phase Project Roadmap — A structured development plan spanning Data Quality & Preprocessing → Feature Engineering → Model Development → System Design → UI & Integration → Monitoring, with LLM-generated phase-specific guidance tailored to the actual dataset and task
- Progress Tracking — Interactive step completion with a visual progress bar persisted across sessions
- Statistical Insights — Pearson correlation detection (|r| > 0.65), IQR-based outlier summaries, distribution skewness detection, and coefficient-of-variation checks
- Time Column Detection — Keyword-based and parse-rate-based detection of datetime columns with confidence scoring
- Feature Suggestion Engine — Domain-aware formula-based feature engineering suggestions (trip duration, fare per minute, average speed, earnings velocity, rolling means)
- Distribution Visualisations — KDE-overlaid histograms (Seaborn) for every numeric column, saved as 300 DPI PNGs
- ydata-profiling HTML Reports — Full exploratory profile reports per table with Pearson correlation, interactions, and missing value diagrams (configurable for performance)
- Collaborative AI Chat Assistant — Multi-turn conversation backed by a local Ollama model with dataset-aware and roadmap-aware system prompting, quick-start buttons, and scrollable chat UI
- Ollama Health Monitoring — Live connection status indicator and local model picker in the sidebar
DataMind follows a clean layered architecture with strict separation between UI, processing, inference, and I/O concerns.
┌──────────────────────────────────────────────────────────────────────┐
│ USER INTERFACE LAYER │
│ Streamlit (app.py — 1,021 lines) │
│ Sidebar Navigation · Section Rendering · Session State Management │
│ Custom CSS Design System (DM Mono · Syne · Inter · Dark Theme) │
└─────────────────────────────┬────────────────────────────────────────┘
│
┌───────────────────┼───────────────────┐
▼ ▼ ▼
┌─────────────────┐ ┌────────────────┐ ┌──────────────────────────┐
│ DATA LOADING │ │ STATISTICAL │ │ AI / LLM LAYER │
│ LAYER │ │ ANALYSIS │ │ │
│ │ │ LAYER │ │ llm_client.py │
│ loader.py │ │ │ │ ├── ask() / ask_json() │
│ ├── ZIP unpack │ │ stats_ │ │ ├── chat_turn() │
│ ├── CSV parse │ │ analyzer.py │ │ ├── health_check() │
│ └── name clean │ │ insight_ │ │ └── list_models() │
│ │ │ generator.py │ │ │
└─────────────────┘ │ time_series_ │ │ context_analyzer.py │
│ detector.py │ │ research_agent.py │
│ feature_ │ │ project_planner.py │
│ suggester.py │ │ collaborative_agent.py │
└────────────────┘ └──────────────────────────┘
│ │ │
└───────────────────┼───────────────────┘
▼
┌───────────────────────────────┐
│ SCHEMA & QUALITY LAYER │
│ │
│ schema_detector.py │
│ fk_detector.py │
│ anomaly_detector.py │
└───────────────────────────────┘
│
▼
┌───────────────────────────────┐
│ VISUALISATION & REPORT LAYER │
│ │
│ visualizer.py (Seaborn) │
│ relationship_graph.py (NetworkX)│
│ profiling_reports.py (ydata) │
│ multi_table_report.py (Jinja2)│
└───────────────────────────────┘
│
▼
┌───────────────────────────────┐
│ OUTPUT LAYER │
│ │
│ outputs/plots/*.png │
│ outputs/profiles/*.html │
│ outputs/join_graph.png │
│ outputs/EDA_REPORT.html │
│ temp_data/ (uploaded ZIPs) │
└───────────────────────────────┘
| Decision | Rationale |
|---|---|
| Fully local LLM via Ollama | Zero cost, zero data leakage, works offline |
| Heuristic-first, LLM-second | Heuristics provide instant results; LLM enriches with nuance without being a hard dependency |
| Streamlit Session State | All computed artefacts (roadmap, anomaly report, resources, chat history) are cached in st.session_state to avoid redundant recomputation |
| ZIP-based multi-table input | Enables real-world multi-file dataset bundles in a single upload action |
| Modular Python files | Each analytical concern is isolated into its own module, making the codebase independently testable |
EDA/
│
├── app.py # Main Streamlit application (entry point, UI, navigation)
│
├── loader.py # ZIP extraction and CSV loading pipeline
├── stats_analyzer.py # Per-table statistics (rows, cols, memory, missing %)
├── schema_detector.py # Shared-column schema detection across tables
├── fk_detector.py # Foreign key relationship inference (primary key + overlap)
├── time_series_detector.py # Datetime column detection (keyword + parse-rate)
├── feature_suggester.py # Domain-aware feature engineering formula suggestions
├── insight_generator.py # Correlation, outlier, skew, and variability insights
├── visualizer.py # Seaborn KDE+histogram plots saved as PNG
├── relationship_graph.py # NetworkX directed graph + Matplotlib visualisation
├── profiling_reports.py # ydata-profiling HTML report generation
├── multi_table_report.py # Jinja2 HTML summary report (multi-table overview)
│
├── anomaly_detector.py # 7-check data quality engine with 0-100 quality scoring
├── context_analyzer.py # Task type + domain inference (heuristic + LLM JSON)
├── research_agent.py # Curated offline resource library + LLM relevance enrichment
├── project_planner.py # 6-phase static roadmap scaffold + LLM phase guidance
├── collaborative_agent.py # Multi-turn LLM chat with dataset-aware system prompting
├── llm_client.py # Ollama REST client (ask, chat_turn, ask_json, health_check)
│
├── outputs/ # Auto-generated output artefacts
│ ├── plots/ # Distribution histograms per numeric column
│ ├── profiles/ # ydata-profiling HTML reports per table
│ ├── join_graph.png # Table relationship graph image
│ └── EDA_REPORT.html # Consolidated multi-table HTML summary
│
└── temp_data/ # Temporary storage for uploaded ZIP files
| Module | Responsibility |
|---|---|
app.py |
Orchestrates all sections, manages Streamlit session state, renders custom CSS design system |
loader.py |
Opens ZIP archives, discovers all CSV files, strips path prefixes for clean table names |
stats_analyzer.py |
Computes row counts, column types, memory footprint, missing value percentages, and descriptive statistics |
schema_detector.py |
Builds a column→tables inverted index to find columns shared across multiple tables |
fk_detector.py |
Applies primary-key uniqueness test + ≥95% value-overlap threshold to infer FK relationships |
anomaly_detector.py |
7 independent quality checks with per-severity deduction scoring system |
context_analyzer.py |
Two-stage inference: 60+ regex heuristics for instant results, then LLM for structured JSON enrichment |
research_agent.py |
Offline curated resource library across 5 task types; LLM generates project-specific "why relevant" captions |
project_planner.py |
6-phase, 18-step static roadmap with three step types (User Decision, Collaborative, Automated); LLM injects tailored phase guidance |
collaborative_agent.py |
Builds dataset-aware + roadmap-aware system prompts; wraps multi-turn chat_turn() for the AI Assistant section |
llm_client.py |
Thin HTTP client over Ollama's REST API — ask(), chat_turn(), ask_json(), health check, model listing |
insight_generator.py |
Pearson correlation ( |
time_series_detector.py |
Keyword matching + pd.to_datetime parse-rate (>70%) for datetime column detection |
feature_suggester.py |
Pattern-based domain feature ideas: trip duration, fare/min, average speed, earnings velocity, rolling means |
visualizer.py |
Seaborn histplot + KDE for every numeric column, 300 DPI output |
relationship_graph.py |
NetworkX DiGraph of FK relationships with Spring Layout and annotated edge labels |
profiling_reports.py |
ydata_profiling.ProfileReport with Pearson-only correlation and performance optimisations |
multi_table_report.py |
Jinja2-templated HTML report combining stats, schema, and generated plots |
1. APP LAUNCH
└── Streamlit initialises session state for all computed artefacts
└── Sidebar: Ollama health check → model picker → section navigation
2. DATASET UPLOAD
└── User uploads a ZIP file via st.file_uploader
└── loader.py: extracts ZIP → discovers CSVs → parses with pandas → returns {table_name: DataFrame}
└── All tables stored in st.session_state.dfs
3. OVERVIEW (Section 1)
└── stats_analyzer.py: computes per-table statistics
└── UI: metric cards (tables, rows, columns) + per-table expanders
4. SCHEMA ANALYSIS (Section 2)
└── schema_detector.py: builds column → tables map
└── fk_detector.py: tests uniqueness + overlap for FK inference
└── relationship_graph.py: optional NetworkX graph PNG
5. DATA QUALITY (Section 3)
└── anomaly_detector.py: runs 7 checks across all numeric + categorical columns
└── quality_score(): computes 0-100 score per table (deductions by severity / col count)
└── UI: severity-badged issues, per-severity metric counters, affected row indices
6. CONTEXT INFERENCE (Section 4)
└── User enters optional problem statement
└── context_analyzer.py:
├── Heuristic: 60+ regex patterns → task type + domain + target candidates
└── LLM: ask_json() → structured JSON with approach, observations, caveats
7. RESEARCH DISCOVERY (Section 5)
└── research_agent.py: selects resources from offline library by task type
└── LLM enriches each resource with a project-specific "why relevant" sentence
└── User writes research notes → LLM reviews and provides feedback
8. ROADMAP GENERATION (Section 6)
└── project_planner.py: deep-copies 6-phase static scaffold
└── LLM injects 2-sentence tailored guidance per phase
└── User marks steps done → progress bar updates in sidebar
9. INSIGHTS (Section 7)
└── insight_generator.py: correlation, outlier, skew, variability analysis
└── time_series_detector.py: datetime column detection
└── feature_suggester.py: domain-aware formula suggestions
10. VISUALISATIONS (Section 8)
└── visualizer.py: Seaborn histplot + KDE per numeric column
└── Output: outputs/plots/{table}_{col}.png
11. PROFILING REPORTS (Section 9)
└── profiling_reports.py: ydata-profiling HTML per table
└── Output: outputs/profiles/{table}_profile.html
12. AI ASSISTANT (Section 10)
└── collaborative_agent.py builds system prompt with dataset schema + context + roadmap progress
└── chat_turn(): full multi-turn conversation via Ollama /api/chat
└── Quick-start prompts for common data science questions
| Category | Technology |
|---|---|
| UI Framework | Streamlit |
| LLM Runtime | Ollama (local, OpenAI-compatible REST API) |
| LLM Models | llama3.2, llama3.1, Mistral, Gemma3 (user-selectable) |
| Data Processing | pandas, NumPy |
| Statistical Analysis | pandas .corr(), .describe(), .skew() |
| Visualisation | Matplotlib, Seaborn |
| Graph Visualisation | NetworkX |
| Profiling | ydata-profiling (ProfileReport) |
| HTML Templating | Jinja2 |
| HTTP Client | Python standard library urllib |
| Typography | Google Fonts — Inter, Syne, DM Mono |
- Python 3.10 or higher
- Ollama installed and running locally
git clone https://github.com/<your-username>/datamind-eda.git
cd datamind-edapip install -r requirements.txt# Start the Ollama server
ollama serve
# In a separate terminal, pull a model (choose one)
ollama pull llama3.2 # Recommended — fast and capable
ollama pull mistral # Alternative
ollama pull gemma3 # Alternativestreamlit run app.pyThe app will open at http://localhost:8501 in your browser.
| Package | Purpose |
|---|---|
streamlit |
Web UI framework |
pandas |
DataFrame manipulation |
numpy |
Numerical operations |
matplotlib |
Plot rendering |
seaborn |
Statistical visualisations |
networkx |
Graph construction and layout |
ydata-profiling |
Automated HTML profiling reports |
jinja2 |
HTML report templating |
Note: No external LLM API keys or cloud services are required. All AI features run entirely through the local Ollama runtime.
Package your CSV files into a single ZIP archive. Files may be in subdirectories; DataMind will discover all CSVs automatically and use the filename (without extension) as the table name.
my_dataset.zip
├── trips.csv
├── sensors/
│ └── accelerometer_data.csv
└── earnings.csv
- Open the app at
http://localhost:8501 - Upload your ZIP using the file uploader
- Navigate through the sidebar sections in order
Follow the navigation sections in the intended sequence for the best experience:
| Step | Section | Action |
|---|---|---|
| 1 | Overview | Review table shapes and missing values |
| 2 | Schema | Understand cross-table relationships; generate graph |
| 3 | Data Quality | Run quality analysis; note high-severity issues |
| 4 | Context | Enter your project goal; analyse task type and domain |
| 5 | Research | Discover resources; write and submit research notes |
| 6 | Roadmap | Generate your personalised roadmap; track progress |
| 7 | Insights | Review correlations, feature suggestions, and time columns |
| 8 | Visualisations | Generate distribution plots for all numeric columns |
| 9 | Profiling Reports | Generate full HTML profiles for deep-dive analysis |
| 10 | AI Assistant | Chat with your local AI for guidance on any step |
Use the quick-start prompts or type any question:
- "What preprocessing steps should I prioritise?"
- "Which model should I start with for this task?"
- "Explain the most critical anomalies found"
- "What features are likely most predictive?"
DataMind is designed for tabular, structured CSV data across a wide range of domains:
| Domain | Typical Signals Detected |
|---|---|
| Transport / Mobility | trip, driver, fare, distance, speed, route, gps |
| IoT / Sensor | sensor, accelerometer, gyroscope, temperature, pressure, audio |
| Finance | price, stock, transaction, fraud, credit, loan |
| Healthcare | patient, diagnosis, medication, heart_rate, blood |
| E-Commerce / Retail | product, order, cart, purchase, sku, inventory |
| Human Activity | activity, step, walk, run, gesture, posture, imu |
Dataset constraints:
- File format: CSV (inside a ZIP archive)
- Multi-table support: unlimited tables per ZIP
- Column types: numeric (
float64,int64) and categorical (object) - Datetime columns: automatically detected and excluded from numeric analysis
| Output | Description | Format |
|---|---|---|
| Distribution Histograms | KDE-overlaid histogram for every numeric column in every table | PNG (300 DPI) |
| Relationship Graph | Directed FK graph with edge labels showing join keys | PNG (300 DPI) |
| Output | Description | Format |
|---|---|---|
| ydata Profiling Report | Full per-table EDA report with correlations, distributions, missing value analysis | HTML |
| Multi-Table Summary | Jinja2 HTML report combining all tables, schema, and plots | HTML |
| Section | What is Shown |
|---|---|
| Data Quality | 7-check anomaly report per table, severity badges, quality score 0–100, affected row indices |
| Insights | Strong correlations ( |
| Context | Task type, domain, confidence level, target variable candidates with scoring rationale |
| Research | Curated papers, libraries, and notebooks with LLM-generated project-specific relevance |
| Roadmap | 6-phase, 18-step interactive development plan with phase guidance and step completion tracking |
- Lazy computation: Each section computes its analysis only when the user navigates to it; nothing runs at upload time except the CSV loading
- Session state caching: All expensive computations (anomaly scan, context inference, roadmap generation, resource discovery) are stored in
st.session_stateand never recomputed unless explicitly triggered again via the action button - LLM graceful degradation: Every LLM-dependent feature has a heuristic fallback. If Ollama is offline, the app continues to function fully with heuristic-only results; the AI Assistant and Research Feedback sections display an actionable offline warning
- ydata-profiling optimised configuration: Spearman, Kendall, Phi-K, and Cramér's correlations are disabled; interaction plots and missing value heatmaps are disabled to reduce report generation time on large datasets
- Plot memory management:
plt.close()is called after every plot to prevent memory accumulation during bulk visualisation generation - Anomaly detection scaling: Quality checks are normalised by column count (
quality_score = max(0, 100 - raw_deductions / n_cols)) so scores remain comparable across tables of different widths
Here is a typical end-to-end session for a driver behaviour classification project:
- Upload — User uploads
driver_pulse_dataset.zipcontainingtrips.csv,accelerometer_data.csv, andearnings.csv - Overview — DataMind reports 3 tables, 250,000 total rows, 47 columns.
trips.csvhas 12% missing values inend_time - Schema — FK relationship detected:
trips.driver_id→earnings.driver_id(98% overlap). Relationship graph generated - Data Quality —
accelerometer_data.x_axishas 340 IQR outliers (high severity).trips.fareshows round-number bias (67% multiples of 5). Overall quality score: 74/100 - Context — User enters: "Predict driver churn from trip and sensor data". DataMind infers: Binary Classification · Transport/Mobility domain · High confidence. Target candidate:
churn_flag(binary, name match) - Research — XGBoost, SHAP, and Imbalanced-learn surfaces as top resources. User notes: "XGBoost handles missing values natively — useful for the end_time gaps". LLM feedback validates the insight and suggests SMOTE for class imbalance
- Roadmap — 6-phase plan generated. LLM guidance for Phase 1: "Focus KNN imputation for end_time given its 12% missingness rate and the strong FK relationship to the earnings table..."
- Insights — Strong correlation between
trip_distanceandfare(r=0.87),x_axisis right-skewed (skew=2.3). Feature suggestion:average_speed = distance / duration - Visualisations — 23 distribution plots generated for all numeric columns
- AI Assistant — User asks: "Should I use SMOTE or class weights for the imbalance?" — AI responds with dataset-specific advice referencing the actual column names and row counts
- Automated feature importance ranking before model selection using permutation importance or mutual information
- Time-series specific EDA module with autocorrelation plots, seasonality decomposition, and stationarity tests
- Custom anomaly threshold configuration per column type via the UI
- Export roadmap to PDF or Notion for team sharing and project tracking integration
- Expanded resource library with domain-specific Kaggle notebook links and GitHub repositories
- Multi-user session support with persistent project workspaces saved to disk or a lightweight database
- Streaming LLM responses in the AI Assistant using Ollama's streaming API for real-time token display
- Automated data cleaning suggestions with one-click application of recommended preprocessing steps
This project is licensed under the MIT License.
MIT License
Copyright (c) 2025
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
|
Roopanshu Guota ML Engineer Learner Built DataMind as part of a data science research initiative focused on AI-augmented exploratory analysis and automated ML project planning. |
DataMind — From raw data to research-ready, locally and intelligently.
If you find this project useful, please consider giving it a star!