AI-powered platform for analyzing ~280M cancer cell line data points through natural language queries. Built with PandasAI 3.0, Azure OpenAI, and R-based CellMiner data processing.
PostgreSQL-powered infrastructure for real-time analysis of cancer genomics datasets. Combines CellMinerCDB data with AI-powered natural language processing to query complex multi-omics data without SQL.
- Data Import: R-based extraction and PostgreSQL bulk loading of 280M+ rows
- Natural Language Queries: GPT-4 powered analysis via PandasAI 3.0
- High-Performance Database: Optimized PostgreSQL with intelligent indexing
- Cell Line Harmonization: FlashText-powered synonym matching across databases
- Export: Publication-ready SVG charts with downloadable code
llm_cellminercdb_pandasai/
βββ pandasai_helper.py # PostgreSQL/PandasAI integration
βββ streamlit_pandasai_chatbot.py # Main Streamlit application
βββ local.r # R data extraction & upload
βββ uploader_optimized.r # R high-performance transfer
βββ pyproject.toml # Python dependencies
βββ env.example # Environment template
βββ rcellminerUtilsCDB/ # R utilities and data files
βββ output_csvs/ # CSV exports
βββ sql_outputs/ # Generated SQL queries
βββ logs/ # System logs
Table: cellminer_combined_data (~280M rows)
| Column | Type | Description | Example Values |
|---|---|---|---|
data_type |
TEXT | Measurement type | 'exp', 'mut', 'cop', 'act' |
dataset |
TEXT | Source database | 'CCLE', 'GDSC' |
row_name |
TEXT | Gene/drug identifier | 'TP53', 'Cisplatin' |
col_name |
TEXT | Cell line name | 'HeLa', 'MCF7' |
value |
DOUBLE PRECISION | Measurement value | 1.23, -2.45 |
cell_line_tissue |
TEXT | Cancer tissue type | 'breast', 'lung', 'colon' |
drug_moa |
TEXT | Drug mechanism | 'DNA synthesis inhibitor' |
drug_synonyms |
TEXT | Alternative drug names | `'cisplatin |
cell_line_synonyms |
TEXT | Cell line identifiers | `'CVCL_0033 |
Data Types (examples):
exp: Gene expression (log2-transformed)mut: Mutation status (binary/VAF)cop: Copy number variation (log2 ratio)act: Drug activity (IC50 values in micromolar)
Download required R data packages from: https://zenodo.org/records/15122311
Source: Luna, A., Elloumi, F., & Rajapakse, V. (2025). CellMiner Cross-Database (CellMinerCDB) Cancer Pharmacogenomics. Zenodo.
The rcellminerUtilsCDB/ directory contains (https://github.com/CBIIT/rcellminerUtilsCDB):
cellLineMatchTab.RData- Cell line matching datadrugSynonymTab.RData- Drug synonym mapping datageneToChromBand.RData- Gene to chromosome band mappingHugoGeneSynonyms.RData- HUGO gene symbol synonymsAug8_nsc_cid_smiles.csv- Chemical structure data (custom file, not included in directory)
pyproject.toml- Python project configuration and dependenciesrequirements.txt- Python package requirementsuv.lock- UV package manager lock fileLICENSE- Project license fileprompts.txt- PandasAI prompt templates
output_csvs/- Contains exported CSV files for cell line and drug synonymssql_outputs/- Stores generated SQL query files with timestampslogs/- System logs directory (auto-created)
The project includes several R package datasets that contain cancer genomics data:
ccleData/gdscData/ctrpData/mdaMillsData/nciSarcomaData/uniSarcomaData/
π₯ Dataset Download: Download the required R data packages from the official CellMinerCDB Zenodo repository: https://zenodo.org/records/15122311
Dataset Source: Luna, A., Elloumi, F., & Rajapakse, V. (2025). CellMiner Cross-Database (CellMinerCDB) Cancer Pharmacogenomics. Zenodo. https://doi.org/10.5281/zenodo.15122311
- Python 3.10+
- R 4.0+ (for data extraction)
- Azure OpenAI account with GPT-4 deployment
- 8GB+ RAM (16GB recommended for large imports)
- 80GB+ disk space (for PostgreSQL data + indexes)
PostgreSQL Setup:
# macOS
brew install postgresql@14
brew services start postgresql@14
# Ubuntu/Debian
sudo apt-get install postgresql postgresql-contribR Setup:
R -e "install.packages(c('Biobase', 'data.table', 'RPostgres', 'DBI'), dependencies=TRUE)"git clone https://github.com/sciluna/llm_cellminercdb_pandasai.git
cd llm_cellminercdb_pandasai
# Install dependencies
uv venv .venv
source .venv/bin/activate
uv pip install -e .Create .env file (copy from env.example):
AZURE_OPENAI_ENDPOINT=your_endpoint
AZURE_OPENAI_API_KEY=your_key
AZURE_OPENAI_DEPLOYMENT_NAME=gpt-4o-mini
AZURE_OPENAI_API_VERSION=2024-05-01-preview
DBEAVER_URL=postgresql://host:port
DBEAVER_DATABASE=postgres
DBEAVER_USERNAME=your_username
DBEAVER_PASSWORD=your_password
PANDASAI_BYPASS_SECURITY=true
PANDASAI_MAX_RETRIES=1Download CellMinerCDB Datasets: Before running any data processing scripts, download the required cancer pharmacogenomics datasets from the official Zenodo repository:
https://zenodo.org/records/15122311
Important: This project uses a zenodo-only approach - download the R packages and extract the RData files directly. No external R packages like rcellminer are required. The data extraction scripts access the data structures directly using Biobase functions.
Download and extract the R packages as needed for your analysis. The datasets include:
- ccleData
- gdscDataDec15
- ctrpData
- nciSarcomaData
- uniSarcomaData
- mdaMillsData
The platform uses R-based data processing to extract and upload CellMiner data:
- Input: CellMiner RData files from R packages (gdscData, ccleData, ctrpData, etc.)
- Process: Extracts data directly from R objects, performs canonical ID mapping
- Output: PostgreSQL tables with enriched metadata and optimized indexes
- Benefits: Direct access to source data structures, comprehensive metadata integration
Verify PostgreSQL:
# Check PostgreSQL status
brew services list | grep postgresql
psql -h localhost -U $USER -d postgres -c "SELECT version();"Import Data (R-based Setup):
# Extract and upload CellMiner data to local PostgreSQL
Rscript local.r # Extract and upload data
Rscript uploader_optimized.r # Transfer to remote DB (optional)uv run streamlit run streamlit_pandasai_chatbot.py# Copy environment file and configure
cp env.example .env
# Edit .env with your credentials
# Build and run with Docker Compose
docker-compose up --build
# Or run with just the app (using external database)
docker-compose up streamlit-app
# Or run with local PostgreSQL database
docker-compose --profile with-db up --build# Build the image
docker build -t cellminer-app .
# Run the container
docker run -p 8501:8501 \
--env-file .env \
-v $(pwd)/logs:/app/logs \
-v $(pwd)/exports:/app/exports \
cellminer-appNavigate to: http://localhost:8501
- Multi-stage build: Optimized for production with minimal image size
- Security: Runs as non-root user
- Persistence: Volume mounts for logs, exports, and outputs
- Health checks: Built-in health monitoring
- Environment variables: Configurable via
.envfile
Required environment variables (set in .env file):
# Azure OpenAI Configuration
AZURE_OPENAI_ENDPOINT=your_endpoint
AZURE_OPENAI_API_KEY=your_api_key
AZURE_OPENAI_DEPLOYMENT_NAME=gpt-4o-mini
AZURE_OPENAI_API_VERSION=2024-05-01-preview
# Database Configuration
DBEAVER_URL=postgresql://host:port/database
DBEAVER_DATABASE=postgres
DBEAVER_USERNAME=your_username
DBEAVER_PASSWORD=your_password
# PandasAI Security Configuration
PANDASAI_BYPASS_SECURITY=true
PANDASAI_MAX_RETRIES=1# Development with auto-rebuild
docker-compose up --build
# Production deployment
docker-compose up -d
# View logs
docker-compose logs -f streamlit-app
# Stop services
docker-compose downMain Streamlit web application providing AI-powered cancer data analysis interface.
R Package Dependencies:
# Install missing Bioconductor packages
R -e "if (!require('BiocManager', quietly = TRUE)) install.packages('BiocManager')"
R -e "BiocManager::install(c('Biobase'))"
# Install CRAN packages
R -e "install.packages(c('RPostgres', 'DBI', 'data.table'), dependencies=TRUE)"This section documents all scripts and utilities currently tracked in the repository and available for use.
Core Application Scripts:
streamlit_pandasai_chatbot.py- Main AI-powered web application for cancer data analysis
Data Processing Utilities:
pandasai_helper.py- Database integration and data processing modulelocal.r- R script for extracting and uploading CellMiner data to local PostgreSQLuploader_optimized.r- R script for high-performance data transfer from local to remote PostgreSQL
Configuration Files:
pyproject.toml- Python project configuration and dependenciesrequirements.txt- Python package requirements (pip format)env.example- Environment variables templateLICENSE- Project licensing terms.streamlit/config.toml- Streamlit application configuration
- Purpose: Primary Streamlit web interface for AI-powered cancer data analysis
- What it does:
- Provides natural language query interface using PandasAI 3.0
- Connects to PostgreSQL database with ~280M cancer genomics data points
- Handles Azure OpenAI integration for query processing
- Generates publication-ready SVG charts with downloadable code
- Manages cell line synonym resolution and data harmonization
- Includes comprehensive error handling and connection validation
- How to run:
source .venv/bin/activate streamlit run streamlit_pandasai_chatbot.py - Dependencies: Requires
.envfile with Azure OpenAI and PostgreSQL credentials - Access: Navigate to
http://localhost:8501after launch
- Purpose: Central module providing database connectivity and data processing utilities
- What it does:
- Manages PostgreSQL connections using environment variables [[memory:4180646]]
- Creates and configures PandasAI datasets for AI analysis
- Provides database connectivity and query optimization utilities
- Provides cell line synonym resolution using FlashText algorithm
- Implements data validation and integrity checking
- Manages connection pooling and transaction handling
- Provides logging with rotating file handlers
- Key Functions:
create_postgres_dataset()- Dataset creation and managementbuild_cell_line_keyword_processor()- Synonym resolution setupconfigure_postgres_for_bulk_import()- Database optimizationsanitize_column_names()- Data preparation utilities
- Usage: Imported by other scripts and the main application
- Environment: Uses
DBEAVER_*credentials exclusively [[memory:4180646]]
- Purpose: R script for extracting CellMiner data from RData files and uploading to local PostgreSQL database
- What it does:
- Extracts molecular data (
exp,mut,cop,mda,xsq,mir,pro,mtb,rrb,met,muf,mth,smt,var,fus,bmt) from multiple datasets - Extracts drug activity data (
act) usinggetAct()function - Processes data from GDSC, CCLE, NCISARCOMA, UNISARCOMA, MDAMILLS, and CTRP datasets
- Creates canonical ID mappings for drugs (PubChem CID) and cell lines (Cellosaurus Accession)
- Enriches data with tissue information from OncoTree1 and mechanism of action (MOA) metadata
- Performs bulk upload to PostgreSQL with optimized chunking and transaction handling
- Creates comprehensive database indexes for fast querying
- Extracts molecular data (
- Configuration:
# User configuration section base_path <- "/path/to/your/project" skip_datasets <- c() # For fresh run, or c("GDSC", "CCLE") to resume # Database configuration db_host <- "localhost" db_port <- 5432 db_name <- "cellminerdb_new" db_user <- "your_username" db_pass <- "your_password"
- How to run:
source .venv/bin/activate Rscript local.r - Output: Creates
cellminer_data_newtable with ~280M rows including canonical identifiers and metadata - Data Safety: Enforces that the
cellminer_data_envtable is never edited or deleted [[memory:6245503]]
- Purpose: Optimized R script for transferring data from local PostgreSQL to remote Azure PostgreSQL
- What it does:
- Transfers complete
cellminer_data_newtable from local to remote PostgreSQL database - Uses advanced PostgreSQL COPY command for maximum transfer speed
- Implements key-based pagination instead of OFFSET for efficient reading
- Provides comprehensive data integrity verification between source and destination
- Supports resume capability with checkpoint tracking for interrupted transfers
- Creates optimized indexes on remote database for query performance
- Transfers complete
- Safety Features:
- Source Protection: Multiple safety checks prevent accidental modification of local database
- Host Validation: Ensures remote database is not localhost to prevent data loss
- Connection Testing: Validates both local and remote connections before transfer
- Integrity Verification: 6-step verification process comparing row counts, distributions, and statistics
- Configuration:
# Performance settings use_copy_method <- TRUE # Use COPY instead of INSERT use_parallel <- FALSE # Enable parallel processing larger_chunks <- TRUE # Use larger chunk sizes fresh_run <- TRUE # Set to FALSE for resume capability # Local database (source) local_db_host <- "localhost" local_db_name <- "cellminerdb_new" local_db_user <- "your_username" # Remote database (destination) remote_db_host <- "your-azure-host.postgres.database.azure.com" remote_db_name <- "postgres" remote_db_user <- "azure_username" remote_db_pass <- "azure_password"
- How to run:
source .venv/bin/activate Rscript uploader_optimized.r
- Purpose: Python project metadata and dependency management
- What it contains:
- Project name, version, and description
- Python version requirement (3.11.*)
- Complete dependency list with pinned versions
- Build system configuration for packaging
- Usage: Used by
uv pip install -e .anduvfor dependency management
- Purpose: Pip-compatible dependency specification
- What it contains: Pinned versions of all Python packages required
- Usage:
uv pip install -r requirements.txt(alternative to pyproject.toml)
- Purpose: Template for required environment variables
- What it contains:
- Azure OpenAI configuration (endpoint, API key, deployment name, API version)
- PostgreSQL connection details (URL, database, username, password) [[memory:4180646]]
- PandasAI security settings (bypass security, max retries)
- How to use:
cp env.example .env # Edit .env with your actual credentials
- Purpose: Defines the legal terms under which the project can be used, modified, and distributed
- Usage: Reference for understanding project licensing terms and obligations
- Purpose: Configures Streamlit application settings and UI preferences
- What it contains: Application theme, server settings, and display options
- Usage: Automatically loaded by Streamlit on application startup
R utility data files for CellMiner integration:
Aug8_nsc_cid_smiles.csv- Chemical structure identifiers and SMILES notation for drug compoundscellLineMatchTab.RData- Cell line harmonization data for cross-database matchingdrugSynonymTab.RData- Drug synonym mappings for compound name standardizationgeneToChromBand.RData- Gene to chromosome band mapping annotationsHugoGeneSynonyms.RData- HUGO gene symbol synonyms for gene name standardization