Skip to content

MilitPatel/cellminercdb_pandasai

Repository files navigation

LLM CellMinerCDB PandasAI

Python 3.10+ Streamlit PandasAI PostgreSQL Azure OpenAI

AI-powered platform for analyzing ~280M cancer cell line data points through natural language queries. Built with PandasAI 3.0, Azure OpenAI, and R-based CellMiner data processing.

Overview

PostgreSQL-powered infrastructure for real-time analysis of cancer genomics datasets. Combines CellMinerCDB data with AI-powered natural language processing to query complex multi-omics data without SQL.

Key Features

  • Data Import: R-based extraction and PostgreSQL bulk loading of 280M+ rows
  • Natural Language Queries: GPT-4 powered analysis via PandasAI 3.0
  • High-Performance Database: Optimized PostgreSQL with intelligent indexing
  • Cell Line Harmonization: FlashText-powered synonym matching across databases
  • Export: Publication-ready SVG charts with downloadable code

System Architecture

Core Files

llm_cellminercdb_pandasai/
β”œβ”€β”€ pandasai_helper.py              # PostgreSQL/PandasAI integration
β”œβ”€β”€ streamlit_pandasai_chatbot.py   # Main Streamlit application
β”œβ”€β”€ local.r                         # R data extraction & upload
β”œβ”€β”€ uploader_optimized.r            # R high-performance transfer
β”œβ”€β”€ pyproject.toml                  # Python dependencies
β”œβ”€β”€ env.example                     # Environment template
β”œβ”€β”€ rcellminerUtilsCDB/             # R utilities and data files
β”œβ”€β”€ output_csvs/                    # CSV exports
β”œβ”€β”€ sql_outputs/                    # Generated SQL queries
└── logs/                           # System logs

Data Model

Table: cellminer_combined_data (~280M rows)

Column Type Description Example Values
data_type TEXT Measurement type 'exp', 'mut', 'cop', 'act'
dataset TEXT Source database 'CCLE', 'GDSC'
row_name TEXT Gene/drug identifier 'TP53', 'Cisplatin'
col_name TEXT Cell line name 'HeLa', 'MCF7'
value DOUBLE PRECISION Measurement value 1.23, -2.45
cell_line_tissue TEXT Cancer tissue type 'breast', 'lung', 'colon'
drug_moa TEXT Drug mechanism 'DNA synthesis inhibitor'
drug_synonyms TEXT Alternative drug names `'cisplatin
cell_line_synonyms TEXT Cell line identifiers `'CVCL_0033

Data Types (examples):

  • exp: Gene expression (log2-transformed)
  • mut: Mutation status (binary/VAF)
  • cop: Copy number variation (log2 ratio)
  • act: Drug activity (IC50 values in micromolar)

Dataset Download

Download required R data packages from: https://zenodo.org/records/15122311

Source: Luna, A., Elloumi, F., & Rajapakse, V. (2025). CellMiner Cross-Database (CellMinerCDB) Cancer Pharmacogenomics. Zenodo.

R Utilities and Data Files

The rcellminerUtilsCDB/ directory contains (https://github.com/CBIIT/rcellminerUtilsCDB):

  • cellLineMatchTab.RData - Cell line matching data
  • drugSynonymTab.RData - Drug synonym mapping data
  • geneToChromBand.RData - Gene to chromosome band mapping
  • HugoGeneSynonyms.RData - HUGO gene symbol synonyms
  • Aug8_nsc_cid_smiles.csv - Chemical structure data (custom file, not included in directory)

Additional Configuration Files

  • pyproject.toml - Python project configuration and dependencies
  • requirements.txt - Python package requirements
  • uv.lock - UV package manager lock file
  • LICENSE - Project license file
  • prompts.txt - PandasAI prompt templates

Output Directories

  • output_csvs/ - Contains exported CSV files for cell line and drug synonyms
  • sql_outputs/ - Stores generated SQL query files with timestamps
  • logs/ - System logs directory (auto-created)

Dataset Directories

The project includes several R package datasets that contain cancer genomics data:

  • ccleData/
  • gdscData/
  • ctrpData/
  • mdaMillsData/
  • nciSarcomaData/
  • uniSarcomaData/

πŸ“₯ Dataset Download: Download the required R data packages from the official CellMinerCDB Zenodo repository: https://zenodo.org/records/15122311

Dataset Source: Luna, A., Elloumi, F., & Rajapakse, V. (2025). CellMiner Cross-Database (CellMinerCDB) Cancer Pharmacogenomics. Zenodo. https://doi.org/10.5281/zenodo.15122311


Installation & Setup

Prerequisites

  • Python 3.10+
  • R 4.0+ (for data extraction)
  • Azure OpenAI account with GPT-4 deployment
  • 8GB+ RAM (16GB recommended for large imports)
  • 80GB+ disk space (for PostgreSQL data + indexes)

PostgreSQL Setup:

# macOS
brew install postgresql@14
brew services start postgresql@14

# Ubuntu/Debian
sudo apt-get install postgresql postgresql-contrib

R Setup:

R -e "install.packages(c('Biobase', 'data.table', 'RPostgres', 'DBI'), dependencies=TRUE)"

Installation

git clone https://github.com/sciluna/llm_cellminercdb_pandasai.git
cd llm_cellminercdb_pandasai

# Install dependencies
uv venv .venv
source .venv/bin/activate
uv pip install -e .

Environment Configuration

Create .env file (copy from env.example):

AZURE_OPENAI_ENDPOINT=your_endpoint
AZURE_OPENAI_API_KEY=your_key
AZURE_OPENAI_DEPLOYMENT_NAME=gpt-4o-mini
AZURE_OPENAI_API_VERSION=2024-05-01-preview

DBEAVER_URL=postgresql://host:port
DBEAVER_DATABASE=postgres
DBEAVER_USERNAME=your_username
DBEAVER_PASSWORD=your_password

PANDASAI_BYPASS_SECURITY=true
PANDASAI_MAX_RETRIES=1

Usage

Step 1: Dataset Setup

Download CellMinerCDB Datasets: Before running any data processing scripts, download the required cancer pharmacogenomics datasets from the official Zenodo repository:

https://zenodo.org/records/15122311

Important: This project uses a zenodo-only approach - download the R packages and extract the RData files directly. No external R packages like rcellminer are required. The data extraction scripts access the data structures directly using Biobase functions.

Download and extract the R packages as needed for your analysis. The datasets include:

  • ccleData
  • gdscDataDec15
  • ctrpData
  • nciSarcomaData
  • uniSarcomaData
  • mdaMillsData

Data Processing Overview

The platform uses R-based data processing to extract and upload CellMiner data:

R-based Processing Pipeline (local.r + uploader_optimized.r)

  • Input: CellMiner RData files from R packages (gdscData, ccleData, ctrpData, etc.)
  • Process: Extracts data directly from R objects, performs canonical ID mapping
  • Output: PostgreSQL tables with enriched metadata and optimized indexes
  • Benefits: Direct access to source data structures, comprehensive metadata integration

Step 2: Database Preparation

Verify PostgreSQL:

# Check PostgreSQL status
brew services list | grep postgresql
psql -h localhost -U $USER -d postgres -c "SELECT version();"

Import Data (R-based Setup):

# Extract and upload CellMiner data to local PostgreSQL
Rscript local.r                    # Extract and upload data
Rscript uploader_optimized.r       # Transfer to remote DB (optional)

3. Launch Application

Option A: Using uv (Recommended for Development)

uv run streamlit run streamlit_pandasai_chatbot.py

Option B: Using Docker (Recommended for Production)

# Copy environment file and configure
cp env.example .env
# Edit .env with your credentials

# Build and run with Docker Compose
docker-compose up --build

# Or run with just the app (using external database)
docker-compose up streamlit-app

# Or run with local PostgreSQL database
docker-compose --profile with-db up --build

Option C: Using Docker directly

# Build the image
docker build -t cellminer-app .

# Run the container
docker run -p 8501:8501 \
  --env-file .env \
  -v $(pwd)/logs:/app/logs \
  -v $(pwd)/exports:/app/exports \
  cellminer-app

Navigate to: http://localhost:8501

Docker Deployment

Features

  • Multi-stage build: Optimized for production with minimal image size
  • Security: Runs as non-root user
  • Persistence: Volume mounts for logs, exports, and outputs
  • Health checks: Built-in health monitoring
  • Environment variables: Configurable via .env file

Environment Variables

Required environment variables (set in .env file):

# Azure OpenAI Configuration
AZURE_OPENAI_ENDPOINT=your_endpoint
AZURE_OPENAI_API_KEY=your_api_key
AZURE_OPENAI_DEPLOYMENT_NAME=gpt-4o-mini
AZURE_OPENAI_API_VERSION=2024-05-01-preview

# Database Configuration
DBEAVER_URL=postgresql://host:port/database
DBEAVER_DATABASE=postgres
DBEAVER_USERNAME=your_username
DBEAVER_PASSWORD=your_password

# PandasAI Security Configuration
PANDASAI_BYPASS_SECURITY=true
PANDASAI_MAX_RETRIES=1

Docker Commands

# Development with auto-rebuild
docker-compose up --build

# Production deployment
docker-compose up -d

# View logs
docker-compose logs -f streamlit-app

# Stop services
docker-compose down

Scripts Documentation

Core Scripts

streamlit_pandasai_chatbot.py

Main Streamlit web application providing AI-powered cancer data analysis interface.

R Package Dependencies:

# Install missing Bioconductor packages
R -e "if (!require('BiocManager', quietly = TRUE)) install.packages('BiocManager')"
R -e "BiocManager::install(c('Biobase'))"

# Install CRAN packages
R -e "install.packages(c('RPostgres', 'DBI', 'data.table'), dependencies=TRUE)"

πŸ“„ Scripts Documentation

This section documents all scripts and utilities currently tracked in the repository and available for use.

πŸ“‹ Complete Script Inventory

Core Application Scripts:

  • streamlit_pandasai_chatbot.py - Main AI-powered web application for cancer data analysis

Data Processing Utilities:

  • pandasai_helper.py - Database integration and data processing module
  • local.r - R script for extracting and uploading CellMiner data to local PostgreSQL
  • uploader_optimized.r - R script for high-performance data transfer from local to remote PostgreSQL

Configuration Files:

  • pyproject.toml - Python project configuration and dependencies
  • requirements.txt - Python package requirements (pip format)
  • env.example - Environment variables template
  • LICENSE - Project licensing terms
  • .streamlit/config.toml - Streamlit application configuration

Core Application Scripts

streamlit_pandasai_chatbot.py - Main Web Application

  • Purpose: Primary Streamlit web interface for AI-powered cancer data analysis
  • What it does:
    • Provides natural language query interface using PandasAI 3.0
    • Connects to PostgreSQL database with ~280M cancer genomics data points
    • Handles Azure OpenAI integration for query processing
    • Generates publication-ready SVG charts with downloadable code
    • Manages cell line synonym resolution and data harmonization
    • Includes comprehensive error handling and connection validation
  • How to run:
    source .venv/bin/activate
    streamlit run streamlit_pandasai_chatbot.py
  • Dependencies: Requires .env file with Azure OpenAI and PostgreSQL credentials
  • Access: Navigate to http://localhost:8501 after launch

Data Processing Utilities

pandasai_helper.py - Core Database Integration Module

  • Purpose: Central module providing database connectivity and data processing utilities
  • What it does:
    • Manages PostgreSQL connections using environment variables [[memory:4180646]]
    • Creates and configures PandasAI datasets for AI analysis
    • Provides database connectivity and query optimization utilities
    • Provides cell line synonym resolution using FlashText algorithm
    • Implements data validation and integrity checking
    • Manages connection pooling and transaction handling
    • Provides logging with rotating file handlers
  • Key Functions:
    • create_postgres_dataset() - Dataset creation and management
    • build_cell_line_keyword_processor() - Synonym resolution setup
    • configure_postgres_for_bulk_import() - Database optimization
    • sanitize_column_names() - Data preparation utilities
  • Usage: Imported by other scripts and the main application
  • Environment: Uses DBEAVER_* credentials exclusively [[memory:4180646]]

local.r - Local Data Extraction and Database Upload

  • Purpose: R script for extracting CellMiner data from RData files and uploading to local PostgreSQL database
  • What it does:
    • Extracts molecular data (exp, mut, cop, mda, xsq, mir, pro, mtb, rrb, met, muf, mth, smt, var, fus, bmt) from multiple datasets
    • Extracts drug activity data (act) using getAct() function
    • Processes data from GDSC, CCLE, NCISARCOMA, UNISARCOMA, MDAMILLS, and CTRP datasets
    • Creates canonical ID mappings for drugs (PubChem CID) and cell lines (Cellosaurus Accession)
    • Enriches data with tissue information from OncoTree1 and mechanism of action (MOA) metadata
    • Performs bulk upload to PostgreSQL with optimized chunking and transaction handling
    • Creates comprehensive database indexes for fast querying
  • Configuration:
    # User configuration section
    base_path <- "/path/to/your/project"
    skip_datasets <- c()  # For fresh run, or c("GDSC", "CCLE") to resume
    
    # Database configuration
    db_host <- "localhost"
    db_port <- 5432
    db_name <- "cellminerdb_new"
    db_user <- "your_username"
    db_pass <- "your_password"
  • How to run:
    source .venv/bin/activate
    Rscript local.r
  • Output: Creates cellminer_data_new table with ~280M rows including canonical identifiers and metadata
  • Data Safety: Enforces that the cellminer_data_env table is never edited or deleted [[memory:6245503]]

uploader_optimized.r - High-Performance Database Transfer

  • Purpose: Optimized R script for transferring data from local PostgreSQL to remote Azure PostgreSQL
  • What it does:
    • Transfers complete cellminer_data_new table from local to remote PostgreSQL database
    • Uses advanced PostgreSQL COPY command for maximum transfer speed
    • Implements key-based pagination instead of OFFSET for efficient reading
    • Provides comprehensive data integrity verification between source and destination
    • Supports resume capability with checkpoint tracking for interrupted transfers
    • Creates optimized indexes on remote database for query performance
  • Safety Features:
    • Source Protection: Multiple safety checks prevent accidental modification of local database
    • Host Validation: Ensures remote database is not localhost to prevent data loss
    • Connection Testing: Validates both local and remote connections before transfer
    • Integrity Verification: 6-step verification process comparing row counts, distributions, and statistics
  • Configuration:
    # Performance settings
    use_copy_method <- TRUE        # Use COPY instead of INSERT
    use_parallel <- FALSE          # Enable parallel processing
    larger_chunks <- TRUE          # Use larger chunk sizes
    fresh_run <- TRUE             # Set to FALSE for resume capability
    
    # Local database (source)
    local_db_host <- "localhost"
    local_db_name <- "cellminerdb_new"
    local_db_user <- "your_username"
    
    # Remote database (destination)
    remote_db_host <- "your-azure-host.postgres.database.azure.com"
    remote_db_name <- "postgres"
    remote_db_user <- "azure_username"
    remote_db_pass <- "azure_password"
  • How to run:
    source .venv/bin/activate
    Rscript uploader_optimized.r

Configuration Files

pyproject.toml - Python Project Configuration

  • Purpose: Python project metadata and dependency management
  • What it contains:
    • Project name, version, and description
    • Python version requirement (3.11.*)
    • Complete dependency list with pinned versions
    • Build system configuration for packaging
  • Usage: Used by uv pip install -e . and uv for dependency management

requirements.txt - Python Dependencies

  • Purpose: Pip-compatible dependency specification
  • What it contains: Pinned versions of all Python packages required
  • Usage: uv pip install -r requirements.txt (alternative to pyproject.toml)

env.example - Environment Variables Template

  • Purpose: Template for required environment variables
  • What it contains:
    • Azure OpenAI configuration (endpoint, API key, deployment name, API version)
    • PostgreSQL connection details (URL, database, username, password) [[memory:4180646]]
    • PandasAI security settings (bypass security, max retries)
  • How to use:
    cp env.example .env
    # Edit .env with your actual credentials

LICENSE - Project License

  • Purpose: Defines the legal terms under which the project can be used, modified, and distributed
  • Usage: Reference for understanding project licensing terms and obligations

.streamlit/config.toml - Streamlit Configuration

  • Purpose: Configures Streamlit application settings and UI preferences
  • What it contains: Application theme, server settings, and display options
  • Usage: Automatically loaded by Streamlit on application startup

Data Resources

rcellminerUtilsCDB/ Directory

R utility data files for CellMiner integration:

  • Aug8_nsc_cid_smiles.csv - Chemical structure identifiers and SMILES notation for drug compounds
  • cellLineMatchTab.RData - Cell line harmonization data for cross-database matching
  • drugSynonymTab.RData - Drug synonym mappings for compound name standardization
  • geneToChromBand.RData - Gene to chromosome band mapping annotations
  • HugoGeneSynonyms.RData - HUGO gene symbol synonyms for gene name standardization

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published