LLM CellMinerCDB PandasAI

AI-powered platform for analyzing ~280M cancer cell line data points through natural language queries. Built with PandasAI 3.0, Azure OpenAI, and R-based CellMiner data processing.

Overview

PostgreSQL-powered infrastructure for real-time analysis of cancer genomics datasets. Combines CellMinerCDB data with AI-powered natural language processing to query complex multi-omics data without SQL.

Key Features

Data Import: R-based extraction and PostgreSQL bulk loading of 280M+ rows
Natural Language Queries: GPT-4 powered analysis via PandasAI 3.0
High-Performance Database: Optimized PostgreSQL with intelligent indexing
Cell Line Harmonization: FlashText-powered synonym matching across databases
Export: Publication-ready SVG charts with downloadable code

System Architecture

Core Files

llm_cellminercdb_pandasai/
├── pandasai_helper.py              # PostgreSQL/PandasAI integration
├── streamlit_pandasai_chatbot.py   # Main Streamlit application
├── local.r                         # R data extraction & upload
├── uploader_optimized.r            # R high-performance transfer
├── pyproject.toml                  # Python dependencies
├── env.example                     # Environment template
├── rcellminerUtilsCDB/             # R utilities and data files
├── output_csvs/                    # CSV exports
├── sql_outputs/                    # Generated SQL queries
└── logs/                           # System logs

Data Model

Table: cellminer_combined_data (~280M rows)

Column	Type	Description	Example Values
`data_type`	TEXT	Measurement type	`'exp'`, `'mut'`, `'cop'`, `'act'`
`dataset`	TEXT	Source database	`'CCLE'`, `'GDSC'`
`row_name`	TEXT	Gene/drug identifier	`'TP53'`, `'Cisplatin'`
`col_name`	TEXT	Cell line name	`'HeLa'`, `'MCF7'`
`value`	DOUBLE PRECISION	Measurement value	`1.23`, `-2.45`
`cell_line_tissue`	TEXT	Cancer tissue type	`'breast'`, `'lung'`, `'colon'`
`drug_moa`	TEXT	Drug mechanism	`'DNA synthesis inhibitor'`
`drug_synonyms`	TEXT	Alternative drug names	`'cisplatin
`cell_line_synonyms`	TEXT	Cell line identifiers	`'CVCL_0033

Data Types (examples):

exp: Gene expression (log2-transformed)
mut: Mutation status (binary/VAF)
cop: Copy number variation (log2 ratio)
act: Drug activity (IC50 values in micromolar)

Dataset Download

Download required R data packages from: https://zenodo.org/records/15122311

Source: Luna, A., Elloumi, F., & Rajapakse, V. (2025). CellMiner Cross-Database (CellMinerCDB) Cancer Pharmacogenomics. Zenodo.

R Utilities and Data Files

The rcellminerUtilsCDB/ directory contains (https://github.com/CBIIT/rcellminerUtilsCDB):

cellLineMatchTab.RData - Cell line matching data
drugSynonymTab.RData - Drug synonym mapping data
geneToChromBand.RData - Gene to chromosome band mapping
HugoGeneSynonyms.RData - HUGO gene symbol synonyms
Aug8_nsc_cid_smiles.csv - Chemical structure data (custom file, not included in directory)

Additional Configuration Files

pyproject.toml - Python project configuration and dependencies
requirements.txt - Python package requirements
uv.lock - UV package manager lock file
LICENSE - Project license file
prompts.txt - PandasAI prompt templates

Output Directories

output_csvs/ - Contains exported CSV files for cell line and drug synonyms
sql_outputs/ - Stores generated SQL query files with timestamps
logs/ - System logs directory (auto-created)

Dataset Directories

The project includes several R package datasets that contain cancer genomics data:

ccleData/
gdscData/
ctrpData/
mdaMillsData/
nciSarcomaData/
uniSarcomaData/

📥 Dataset Download: Download the required R data packages from the official CellMinerCDB Zenodo repository: https://zenodo.org/records/15122311

Dataset Source: Luna, A., Elloumi, F., & Rajapakse, V. (2025). CellMiner Cross-Database (CellMinerCDB) Cancer Pharmacogenomics. Zenodo. https://doi.org/10.5281/zenodo.15122311

Installation & Setup

Prerequisites

Python 3.10+
R 4.0+ (for data extraction)
Azure OpenAI account with GPT-4 deployment
8GB+ RAM (16GB recommended for large imports)
80GB+ disk space (for PostgreSQL data + indexes)

PostgreSQL Setup:

# macOS
brew install postgresql@14
brew services start postgresql@14

# Ubuntu/Debian
sudo apt-get install postgresql postgresql-contrib

R Setup:

R -e "install.packages(c('Biobase', 'data.table', 'RPostgres', 'DBI'), dependencies=TRUE)"

Installation

git clone https://github.com/sciluna/llm_cellminercdb_pandasai.git
cd llm_cellminercdb_pandasai

# Install dependencies
uv venv .venv
source .venv/bin/activate
uv pip install -e .

Environment Configuration

Create .env file (copy from env.example):

AZURE_OPENAI_ENDPOINT=your_endpoint
AZURE_OPENAI_API_KEY=your_key
AZURE_OPENAI_DEPLOYMENT_NAME=gpt-4o-mini
AZURE_OPENAI_API_VERSION=2024-05-01-preview

DBEAVER_URL=postgresql://host:port
DBEAVER_DATABASE=postgres
DBEAVER_USERNAME=your_username
DBEAVER_PASSWORD=your_password

PANDASAI_BYPASS_SECURITY=true
PANDASAI_MAX_RETRIES=1

Usage

Step 1: Dataset Setup

Download CellMinerCDB Datasets: Before running any data processing scripts, download the required cancer pharmacogenomics datasets from the official Zenodo repository:

https://zenodo.org/records/15122311

Important: This project uses a zenodo-only approach - download the R packages and extract the RData files directly. No external R packages like rcellminer are required. The data extraction scripts access the data structures directly using Biobase functions.

Download and extract the R packages as needed for your analysis. The datasets include:

ccleData
gdscDataDec15
ctrpData
nciSarcomaData
uniSarcomaData
mdaMillsData

Data Processing Overview

The platform uses R-based data processing to extract and upload CellMiner data:

R-based Processing Pipeline (local.r + uploader_optimized.r)

Input: CellMiner RData files from R packages (gdscData, ccleData, ctrpData, etc.)
Process: Extracts data directly from R objects, performs canonical ID mapping
Output: PostgreSQL tables with enriched metadata and optimized indexes
Benefits: Direct access to source data structures, comprehensive metadata integration

Step 2: Database Preparation

Verify PostgreSQL:

# Check PostgreSQL status
brew services list | grep postgresql
psql -h localhost -U $USER -d postgres -c "SELECT version();"

Import Data (R-based Setup):

# Extract and upload CellMiner data to local PostgreSQL
Rscript local.r                    # Extract and upload data
Rscript uploader_optimized.r       # Transfer to remote DB (optional)

3. Launch Application

Option A: Using uv (Recommended for Development)

uv run streamlit run streamlit_pandasai_chatbot.py

Option B: Using Docker (Recommended for Production)

# Copy environment file and configure
cp env.example .env
# Edit .env with your credentials

# Build and run with Docker Compose
docker-compose up --build

# Or run with just the app (using external database)
docker-compose up streamlit-app

# Or run with local PostgreSQL database
docker-compose --profile with-db up --build

Option C: Using Docker directly

# Build the image
docker build -t cellminer-app .

# Run the container
docker run -p 8501:8501 \
  --env-file .env \
  -v $(pwd)/logs:/app/logs \
  -v $(pwd)/exports:/app/exports \
  cellminer-app

Navigate to: http://localhost:8501

Docker Deployment

Features

Multi-stage build: Optimized for production with minimal image size
Security: Runs as non-root user
Persistence: Volume mounts for logs, exports, and outputs
Health checks: Built-in health monitoring
Environment variables: Configurable via .env file

Environment Variables

Required environment variables (set in .env file):

# Azure OpenAI Configuration
AZURE_OPENAI_ENDPOINT=your_endpoint
AZURE_OPENAI_API_KEY=your_api_key
AZURE_OPENAI_DEPLOYMENT_NAME=gpt-4o-mini
AZURE_OPENAI_API_VERSION=2024-05-01-preview

# Database Configuration
DBEAVER_URL=postgresql://host:port/database
DBEAVER_DATABASE=postgres
DBEAVER_USERNAME=your_username
DBEAVER_PASSWORD=your_password

# PandasAI Security Configuration
PANDASAI_BYPASS_SECURITY=true
PANDASAI_MAX_RETRIES=1

Docker Commands

# Development with auto-rebuild
docker-compose up --build

# Production deployment
docker-compose up -d

# View logs
docker-compose logs -f streamlit-app

# Stop services
docker-compose down

Scripts Documentation

Core Scripts

`streamlit_pandasai_chatbot.py`

Main Streamlit web application providing AI-powered cancer data analysis interface.

R Package Dependencies:

# Install missing Bioconductor packages
R -e "if (!require('BiocManager', quietly = TRUE)) install.packages('BiocManager')"
R -e "BiocManager::install(c('Biobase'))"

# Install CRAN packages
R -e "install.packages(c('RPostgres', 'DBI', 'data.table'), dependencies=TRUE)"

📄 Scripts Documentation

This section documents all scripts and utilities currently tracked in the repository and available for use.

📋 Complete Script Inventory

Core Application Scripts:

streamlit_pandasai_chatbot.py - Main AI-powered web application for cancer data analysis

Data Processing Utilities:

pandasai_helper.py - Database integration and data processing module
local.r - R script for extracting and uploading CellMiner data to local PostgreSQL
uploader_optimized.r - R script for high-performance data transfer from local to remote PostgreSQL

Configuration Files:

pyproject.toml - Python project configuration and dependencies
requirements.txt - Python package requirements (pip format)
env.example - Environment variables template
LICENSE - Project licensing terms
.streamlit/config.toml - Streamlit application configuration

Core Application Scripts

`streamlit_pandasai_chatbot.py` - Main Web Application

Purpose: Primary Streamlit web interface for AI-powered cancer data analysis
What it does:
- Provides natural language query interface using PandasAI 3.0
- Connects to PostgreSQL database with ~280M cancer genomics data points
- Handles Azure OpenAI integration for query processing
- Generates publication-ready SVG charts with downloadable code
- Manages cell line synonym resolution and data harmonization
- Includes comprehensive error handling and connection validation

How to run:

source .venv/bin/activate
streamlit run streamlit_pandasai_chatbot.py

Dependencies: Requires .env file with Azure OpenAI and PostgreSQL credentials
Access: Navigate to http://localhost:8501 after launch

Data Processing Utilities

`pandasai_helper.py` - Core Database Integration Module

Purpose: Central module providing database connectivity and data processing utilities
What it does:
- Manages PostgreSQL connections using environment variables [[memory:4180646]]
- Creates and configures PandasAI datasets for AI analysis
- Provides database connectivity and query optimization utilities
- Provides cell line synonym resolution using FlashText algorithm
- Implements data validation and integrity checking
- Manages connection pooling and transaction handling
- Provides logging with rotating file handlers
Key Functions:
- create_postgres_dataset() - Dataset creation and management
- build_cell_line_keyword_processor() - Synonym resolution setup
- configure_postgres_for_bulk_import() - Database optimization
- sanitize_column_names() - Data preparation utilities
Usage: Imported by other scripts and the main application
Environment: Uses DBEAVER_* credentials exclusively [[memory:4180646]]

`local.r` - Local Data Extraction and Database Upload

Purpose: R script for extracting CellMiner data from RData files and uploading to local PostgreSQL database
What it does:
- Extracts molecular data (exp, mut, cop, mda, xsq, mir, pro, mtb, rrb, met, muf, mth, smt, var, fus, bmt) from multiple datasets
- Extracts drug activity data (act) using getAct() function
- Processes data from GDSC, CCLE, NCISARCOMA, UNISARCOMA, MDAMILLS, and CTRP datasets
- Creates canonical ID mappings for drugs (PubChem CID) and cell lines (Cellosaurus Accession)
- Enriches data with tissue information from OncoTree1 and mechanism of action (MOA) metadata
- Performs bulk upload to PostgreSQL with optimized chunking and transaction handling
- Creates comprehensive database indexes for fast querying

Configuration:

# User configuration section
base_path <- "/path/to/your/project"
skip_datasets <- c()  # For fresh run, or c("GDSC", "CCLE") to resume

# Database configuration
db_host <- "localhost"
db_port <- 5432
db_name <- "cellminerdb_new"
db_user <- "your_username"
db_pass <- "your_password"

How to run:

source .venv/bin/activate
Rscript local.r

Output: Creates cellminer_data_new table with ~280M rows including canonical identifiers and metadata
Data Safety: Enforces that the cellminer_data_env table is never edited or deleted [[memory:6245503]]

`uploader_optimized.r` - High-Performance Database Transfer

Purpose: Optimized R script for transferring data from local PostgreSQL to remote Azure PostgreSQL
What it does:
- Transfers complete cellminer_data_new table from local to remote PostgreSQL database
- Uses advanced PostgreSQL COPY command for maximum transfer speed
- Implements key-based pagination instead of OFFSET for efficient reading
- Provides comprehensive data integrity verification between source and destination
- Supports resume capability with checkpoint tracking for interrupted transfers
- Creates optimized indexes on remote database for query performance
Safety Features:
- Source Protection: Multiple safety checks prevent accidental modification of local database
- Host Validation: Ensures remote database is not localhost to prevent data loss
- Connection Testing: Validates both local and remote connections before transfer
- Integrity Verification: 6-step verification process comparing row counts, distributions, and statistics

Configuration:

# Performance settings
use_copy_method <- TRUE        # Use COPY instead of INSERT
use_parallel <- FALSE          # Enable parallel processing
larger_chunks <- TRUE          # Use larger chunk sizes
fresh_run <- TRUE             # Set to FALSE for resume capability

# Local database (source)
local_db_host <- "localhost"
local_db_name <- "cellminerdb_new"
local_db_user <- "your_username"

# Remote database (destination)
remote_db_host <- "your-azure-host.postgres.database.azure.com"
remote_db_name <- "postgres"
remote_db_user <- "azure_username"
remote_db_pass <- "azure_password"

How to run:

source .venv/bin/activate
Rscript uploader_optimized.r

Configuration Files

`pyproject.toml` - Python Project Configuration

Purpose: Python project metadata and dependency management
What it contains:
- Project name, version, and description
- Python version requirement (3.11.*)
- Complete dependency list with pinned versions
- Build system configuration for packaging
Usage: Used by uv pip install -e . and uv for dependency management

`requirements.txt` - Python Dependencies

Purpose: Pip-compatible dependency specification
What it contains: Pinned versions of all Python packages required
Usage: uv pip install -r requirements.txt (alternative to pyproject.toml)

`env.example` - Environment Variables Template

Purpose: Template for required environment variables
What it contains:
- Azure OpenAI configuration (endpoint, API key, deployment name, API version)
- PostgreSQL connection details (URL, database, username, password) [[memory:4180646]]
- PandasAI security settings (bypass security, max retries)

How to use:

cp env.example .env
# Edit .env with your actual credentials

`LICENSE` - Project License

Purpose: Defines the legal terms under which the project can be used, modified, and distributed
Usage: Reference for understanding project licensing terms and obligations

`.streamlit/config.toml` - Streamlit Configuration

Purpose: Configures Streamlit application settings and UI preferences
What it contains: Application theme, server settings, and display options
Usage: Automatically loaded by Streamlit on application startup

Data Resources

`rcellminerUtilsCDB/` Directory

R utility data files for CellMiner integration:

Aug8_nsc_cid_smiles.csv - Chemical structure identifiers and SMILES notation for drug compounds
cellLineMatchTab.RData - Cell line harmonization data for cross-database matching
drugSynonymTab.RData - Drug synonym mappings for compound name standardization
geneToChromBand.RData - Gene to chromosome band mapping annotations
HugoGeneSynonyms.RData - HUGO gene symbol synonyms for gene name standardization

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.streamlit		.streamlit
rcellminerUtilsCDB		rcellminerUtilsCDB
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
env.example		env.example
local.r		local.r
pandasai_helper.py		pandasai_helper.py
pyproject.toml		pyproject.toml
streamlit_pandasai_chatbot.py		streamlit_pandasai_chatbot.py
uploader_optimized.r		uploader_optimized.r

License

MilitPatel/cellminercdb_pandasai

Folders and files

Latest commit

History

Repository files navigation

LLM CellMinerCDB PandasAI

Overview

Key Features

System Architecture

Core Files

Data Model

Dataset Download

R Utilities and Data Files

Additional Configuration Files

Output Directories

Dataset Directories

Installation & Setup

Prerequisites

Installation

Environment Configuration

Usage

Step 1: Dataset Setup

Data Processing Overview

R-based Processing Pipeline (local.r + uploader_optimized.r)

Step 2: Database Preparation

3. Launch Application

Option A: Using uv (Recommended for Development)

Option B: Using Docker (Recommended for Production)

Option C: Using Docker directly

Docker Deployment

Features

Environment Variables

Docker Commands

Scripts Documentation

Core Scripts

streamlit_pandasai_chatbot.py

📄 Scripts Documentation

📋 Complete Script Inventory

Core Application Scripts

streamlit_pandasai_chatbot.py - Main Web Application

Data Processing Utilities

pandasai_helper.py - Core Database Integration Module

local.r - Local Data Extraction and Database Upload

uploader_optimized.r - High-Performance Database Transfer

Configuration Files

pyproject.toml - Python Project Configuration

requirements.txt - Python Dependencies

env.example - Environment Variables Template

LICENSE - Project License

.streamlit/config.toml - Streamlit Configuration

Data Resources

rcellminerUtilsCDB/ Directory

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`streamlit_pandasai_chatbot.py`

`streamlit_pandasai_chatbot.py` - Main Web Application

`pandasai_helper.py` - Core Database Integration Module

`local.r` - Local Data Extraction and Database Upload

`uploader_optimized.r` - High-Performance Database Transfer

`pyproject.toml` - Python Project Configuration

`requirements.txt` - Python Dependencies

`env.example` - Environment Variables Template

`LICENSE` - Project License

`.streamlit/config.toml` - Streamlit Configuration

`rcellminerUtilsCDB/` Directory

Packages