DataVerse ChatBot

DataVerse ChatBot is a powerful Python-based application that enables real-time, AI-driven chat interactions by extracting and processing data from virtually any source—web pages, nearly all file formats, and more. Combining advanced web crawling, multi-format data extraction, and Retrieval-Augmented Generation (RAG) techniques, it integrates with leading Large Language Models (LLMs) to deliver context-aware responses. The system includes a sophisticated chat history analysis agent that provides actionable insights on user engagement patterns and response quality. Deployable as WhatsApp and Telegram bots or embedded via an iframe, DataVerse ChatBot also supports voice messages, making it a versatile tool for conversational AI.

Features

Data Extraction and Processing

Web Crawling: Supports 2 libraries for web crawling; crawl4ai and scrapegraphai to efficiently gather content from specified web sources, with customizable parameters (e.g., crawl depth, preferred client).
Multi-Format Data Extraction: Supports 2 libraries for data extraction; langchain and docling to extract data from nearly all file formats (e.g., PDFs, text files, docx, csv, xlsx, etc...), broadening its knowledge base beyond web content.
Content Storage: Saves extracted data in data/web_content/ as text files in a clean markdown format (which LLMs love) for indexing and retrieval.

Chat History Agent

An intelligent agent powered by LangChain that analyzes conversation data to extract insights about common user questions, peak usage times, response quality, user engagement patterns, ...etc.

Monitoring and Uncertainty Detection

Response Monitoring: Implements a monitoring service to track responses and detect the questions that the LLM couldn't answer.
Uncertain Response Classification: Uses a trained classifier to detect uncertain responses from LLM.
Email Notifications: Sends alerts via email when uncertain responses are detected.
Chat History Monitoring: Periodically emails the chat history to a configured email address.
The inference and the monitoring services all run in a separate thread. This ensures that the main program doesn't freeze.

Dataset Creation and Model Training

Dataset Creation: Uses the make_dataset.py script to create standardized datasets from RAG responses. It supports:
- Data cleaning and tokenization using sentence-transformers/all-MiniLM-L6-v2.
- Saving structured responses in CSV format for further processing.
- Shuffling and saving datasets for training purposes.
Content Storage: Saves the dataset in data/datasets/.
Training Scripts: Utilizes train_clf.py to train classification models with support for:
- Random Forest and XGBoost classifiers.
- Hyperparameter tuning using RandomizedSearchCV.
- Embedding generation with sentence-transformers/all-MiniLM-L6-v2.
Evaluation Metrics:
- Accuracy, Precision, and Recall.
- ROC and Precision-Recall Curve plotting.
- The model achieved 92.7% accuracy on the test set.
Model Persistence: Saves trained models as .pkl files and logs metadata (e.g., hyperparameters, evaluation results, versions of the libraries, ...).

Chat Interfaces

WhatsApp Bot: Deployable as a WhatsApp bot using Twilio for seamless, mobile-friendly conversations.
Telegram Bot: Available as a Telegram bot, integrating with Telegram’s messaging ecosystem.
Iframe Embedding: Embeddable via an iframe for easy integration into websites or applications. Built with FastAPI for the backend; HTML, CSS, and JavaScript for the frontend.

Large Language Model (LLM) Integration

Multiple LLM Support: Integrates with various LLMs, including:
- OpenAI
- Claude
- Cohere
- DeepSeek
- Gemini
- Grok
- Mistral
Flexible LLM Selection: Configurable via settings to switch between LLMs based on user preference or use case.

Retrieval-Augmented Generation (RAG)

Base RAG Framework: Provides a consistent RAG interface for retrieval and generation.
LLM-Specific RAG: Custom implementations for each supported LLM, optimizing performance.
Vector Store Integration: Uses FAISS (in data/indexes/) for fast, efficient document retrieval.
Context-Aware Responses: Combines extracted data with LLM capabilities for accurate replies.

Embedding Generation

Base Embedding System: Generates embeddings for data and queries via a reusable interface.
Multiple Embedding Models: Supports embedding APIs from LLMs (Cohere, Mistral, OpenAI) or standalone models (HuggingFace).
Content Indexing: Stores embeddings in FAISS indexes (index.faiss, index.pkl) for quick retrieval.

Chat Functionality

Chat History Persistence: Saves conversations in a SQLite database (chat_history.db).
Context Retention: Maintains conversational context using history and retrieved data.
Query Processing: Processes user inputs through embeddings and RAG for response generation.

Modular Design

Package Structure: Organized into reusable modules:
- crawler.py: Web crawling logic.
- embeddings/: Embedding generation.
- rag/: RAG implementations.
- utils/: Helper functions.
Extensibility: Easy to add new LLMs, embedding models, or features.

Admin Dashboard

The admin dashboard provides a centralized interface for managing the RAG system, offering tools to monitor usage, manage content, and update account settings.

Admin Login: Secure access to the admin panel with a username and password.
System Overview Dashboard: Displays key metrics such as total users, active conversations, token usage, and costs over the last 24 hours. It also includes a table of recent conversations and a pie chart showing model usage distribution.
Content Management : Allows the admin to upload files or crawl websites to expand the RAG system's knowledge base.
Account Settings : Admins can update their username and password.

Setup and Installation

Automated Setup: Single-command installation via install.bat (Windows) for dependencies and configuration.
Dependency Management: Installs packages from pyproject.toml (built by uv).
Environment Configuration: Automated configuration.

Data Management

Persistent Storage: Organizes data into:
- data/web_content/: Extracted content.
- data/indexes/: Vectorized indexes.
- data/database/: Chat history database (also monitors the induced costs for using the LLMs across all chat interfaces).
Efficient Retrieval: Uses FAISS for scalable similarity searches.
Database Support: Lightweight SQLite for chat history.

Installation

Windows

Clone the repository:

git clone https://github.com/AliElneklawy/DataVerse-ChatBot.git
cd DataVerse-ChatBot

Run the installation script:
```
install.bat
```

This installs dependencies and configures the environment.

Usage

Configure your .env file with API keys.
Run the application (you can choose to run the telegram bot, the whatsapp bot, the ifram or just the main.py file):
```
python main.py
```
You can create your own dataset using make_dataset.py script.
You can train your own classifier using train_clf.py script.

Project Structure

├── DataVerse-Chatbot
│   ├── __init__.py 
│   ├── data 
│   │   ├── chat_history 
│   │   │   ├── ....
│   │   ├── database 
│   │   │   ├── ....
│   │   ├── datasets 
│   │   │   ├── ....
│   │   ├── indexes 
│   │   │   ├── ....
│   │   ├── logs 
│   │   │   ├── ....
│   │   ├── models 
│   │   │   ├── clf.pkl 
│   │   │   ├── metadata.json 
│   │   ├── training_files 
│   │   │   ├── ....
│   │   ├── web_content 
│   │   │   ├── ....
│   ├── src 
│   │   ├── admin_dashboard_launcher.py 
│   │   ├── main.py 
│   │   ├── tg_bot.py 
│   │   ├── whatsapp_bot.py 
│   │   ├── __init__.py 
│   │   ├── chatbot 
│   │   │   ├── config.py 
│   │   │   ├── crawler.py 
│   │   │   ├── voice_mode.py 
│   │   │   ├── __init__.py 
│   │   │   ├── embeddings 
│   │   │   │   ├── base_embedding.py 
│   │   │   │   ├── __init__.py 
│   │   │   ├── rag 
│   │   │   │   ├── base_rag.py 
│   │   │   │   ├── claude_rag.py 
│   │   │   │   ├── cohere_rag.py 
│   │   │   │   ├── deepseek_rag.py 
│   │   │   │   ├── gemini_rag.py 
│   │   │   │   ├── grok_rag.py 
│   │   │   │   ├── mistral_rag.py 
│   │   │   │   ├── openai_rag.py 
│   │   │   │   ├── __init__.py 
│   │   │   ├── utils 
│   │   │   │   ├── admin_utils.py 
│   │   │   │   ├── crawler_progress.py 
│   │   │   │   ├── file_loader.py 
│   │   │   │   ├── inference.py 
│   │   │   │   ├── make_dataset.py 
│   │   │   │   ├── monitor_service.py 
│   │   │   │   ├── paths.py 
│   │   │   │   ├── train_clf.py 
│   │   │   │   ├── utils.py 
│   │   │   │   ├── __init__.py 
│   │   ├── web 
│   │   │   ├── admin_dashboard.py 
│   │   │   ├── chat_web_app.py 
│   │   │   ├── chat_web_template.py 
│   │   │   ├── how to run.txt 
│   │   │   ├── __init__.py 
│   │   │   ├── static 
│   │   │   │   ├── __init__.py 
│   │   │   │   ├── css 
│   │   │   │   │   ├── admin.css 
│   │   │   │   │   ├── dark_mode.css 
│   │   │   │   │   ├── __init__.py 
│   │   │   │   ├── js 
│   │   │   │   │   ├── admin.js 
│   │   │   │   │   ├── __init__.py 
│   │   │   ├── templates 
│   │   │   │   ├── __init__.py 
│   │   │   │   ├── admin 
│   │   │   │   │   ├── account.html 
│   │   │   │   │   ├── base.html 
│   │   │   │   │   ├── content.html 
│   │   │   │   │   ├── dashboard.html 
│   │   │   │   │   ├── history.html 
│   │   │   │   │   ├── login.html 
│   │   │   │   │   ├── models.html 
│   │   │   │   │   ├── system.html 
│   │   │   │   │   ├── users.html 
│   │   │   │   │   ├── user_detail.html 
│   │   │   │   │   ├── view_content.html 
│   │   │   │   │   ├── __init__.py 
│   ├── tests 
│   │   ├── locustfile.py

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
assets		assets
data		data
docs		docs
src		src
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
README.md		README.md
__init__.py		__init__.py
docker-compose.yml		docker-compose.yml
install.bat		install.bat
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DataVerse ChatBot

Features

Data Extraction and Processing

Chat History Agent

Monitoring and Uncertainty Detection

Dataset Creation and Model Training

Chat Interfaces

Large Language Model (LLM) Integration

Retrieval-Augmented Generation (RAG)

Embedding Generation

Chat Functionality

Modular Design

Admin Dashboard

Setup and Installation

Data Management

Installation

Windows

Usage

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Languages

AliElneklawy/DataVerse-ChatBot

Folders and files

Latest commit

History

Repository files navigation

DataVerse ChatBot

Features

Data Extraction and Processing

Chat History Agent

Monitoring and Uncertainty Detection

Dataset Creation and Model Training

Chat Interfaces

Large Language Model (LLM) Integration

Retrieval-Augmented Generation (RAG)

Embedding Generation

Chat Functionality

Modular Design

Admin Dashboard

Setup and Installation

Data Management

Installation

Windows

Usage

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages