# Detecting Potential Clinical Samples in Refine.bio Metadata

**Objective:** Identify human clinical RNA samples with treatment response labels from the Refine.bio metadata.

## Background

This notebook implements a regex-based detection pipeline to find columns in the Refine.bio metadata that likely contain clinical treatment response data. The metadata contains 6,700+ columns representing the union of all GEO experiment annotations, making manual inspection impractical.

## Approach

1. **Column name filtering:** Identify columns with names containing clinical response terms (e.g., "response", "remission", "PCR")
2. **Value analysis:** Check if column values match known response codes (e.g., CR, PR, SD, PD, PCR, RD)
3. **Binary classification:** Flag columns with binary response patterns (0/1, YES/NO)

## Memory Efficiency

The metadata requires ~40GB RAM when fully loaded. This notebook provides two approaches:

- **Lazy scan (recommended):** Process columns one at a time using `pl.scan_parquet()` - works on any machine
- **In-memory scan (legacy):** Load entire dataset - requires high-memory machine

---

In [None]:
import polars as pl
import pandas as pd
import pyarrow.parquet as pq
from tqdm import tqdm
import re
from pathlib import Path
import numpy as np
from typing import Iterable, List, Tuple, Set


## Configuration and Paths

Set up file paths for the Refine.bio metadata and output files.

In [None]:
# Original path from development environment (not in this repo)
# root = Path("/mnt/hdd/jesse_archive/stampformer_archive/refine_bio/HOMO_SAPIENS")

# Paths for this repository
root = Path("../data")  # Relative to nbs/ directory

**Note:** Reading the full metadata TSV requires ~60 GB RAM. We use the Parquet version with lazy evaluation instead.

In [None]:
# Updated paths for this repository
data_dir = Path('../data')  # Relative to nbs/ directory

# Source TSV (NOT in this repo - original was 3GB)
tsv_meta_path = data_dir / 'metadata_HOMO_SAPIENS.tsv'  # ⚠️ Does not exist here

# Parquet file (✅ Available via Git LFS)
parquet_path = data_dir / "metadata_HOMO_SAPIENS.parquet"

# Schema file (if available)
schema_path = data_dir / "metadata_HOMO_SAPIENS_schema.json"

# Output files
potential_clin_path = data_dir / 'potential_clin_data.csv'

sep = '\t'

In [None]:
# Verify the parquet path is correct
print(f"Parquet path: {parquet_path}")
print(f"File exists: {parquet_path.exists()}")
print(f"File size: {parquet_path.stat().st_size / (1024**2):.1f} MB")

# Test lazy scan
schema = pl.read_parquet_schema(str(parquet_path))
print(f"\nTotal columns: {len(schema.names())}")
print(f"First 4 columns: {schema.names()[:4]}")

Parquet path: ..\data\metadata_HOMO_SAPIENS.parquet
File exists: True
File size: 68.8 MB

Total columns: 6791
First 4 columns: ['refinebio_accession_code', 'experiment_accession', 'refinebio_age', 'refinebio_cell_line']


## Examine the Parquet Metadata

This section explores the metadata structure and identifies columns with potential clinical response data.

In [None]:
# Get column names from schema (no data loading required)
columns = pl.read_parquet_schema(str(parquet_path)).names()
print(f"Total columns: {len(columns)}")
print(f"First 4 columns: {columns[:4]}")

Total columns: 6791
First 4 columns: ['refinebio_accession_code', 'experiment_accession', 'refinebio_age', 'refinebio_cell_line']


### Response Column Detection Functions

Define regex patterns and helper functions to identify columns likely containing treatment response data.

we used the list of response terms compiled by the good folks at ctr db: `http://ctrdb.cloudna.cn/help`

In [None]:
######## edit######


# 1) Canonical response codes/labels (values)
RESPONSE_VALUE_TOKENS: Set[str] = {
    # atomic codes
    "PCR","RD","CR","NR","R","SD","PPR","PNC","PD","CRU","IR","PRCR","NON-CR","VGPR","NCR","PR","CRP",
    "MLFS","HI","RCB","CRI","OJBR","NOR","NC","ICCR","ICPR","ISD","ICPD","IUPD","IUPR","NONE","S","HR","TF","ED",
    # common long forms / keywords
    "PATHOLOGICAL COMPLETE REMISSION","RESIDUAL DISEASE","COMPLETE RESPONSE","NONRESPONSE","RESPONSE","STABLE DISEASE",
    "PATHOLOGICAL PARTIAL RESPONDERS","NON-RESPONDERS","PROGRESSIVE DISEASE","UNCONFIRMED COMPLETED REMISSION",
    "INCOMPLETE RESPONSE","LONG-HER","PARTIAL RESPONSE AND COMPLETE RESPONSE","NON-COMPLETE REMISSION",
    "VERY GOOD PARTIAL RESPONSE","GOOD/COMPLETE PCR","CR WITH INCOMPLETE HEMATOLOGIC OR PLATELET RECOVERY",
    "MORPHOLOGIC LEUKEMIA-FREE STATE","HEMATOLOGIC IMPROVEMENT","RESIDUAL CANCER BURDEN",
    "COMPLETE REMISSION WITH INCOMPLETE HEMATOLOGIC RECOVERY","OBJECTIVE RESPONDERS","NON-RESPONDERS","NO CHANGE",
    "CONFIRMED COMPLETE RESPONSE","CONFIRMED PARTIAL RESPONSE","CONFIRMED PROGRESSIVE DISEASE",
    "UNCONFIRMED PROGRESSIVE DISEASE","UNCONFIRMED PARTIAL RESPONSE","NO RESPONSE","SENSITIVE",
    "HIGHLY RESISTANT","TREATMENT FAILURE","EARLY DEATH","RESIDUAL","SURVIVAL","PROGRESSIVE",
    # generic binaries
    "YES","NO","Y","N","TRUE","FALSE","T","F","1","0"
}

# Atomic codes for regex matching (short codes that need word boundary matching)
ATOMIC_CODES: Set[str] = {
    "PCR","RD","CR","NR","R","SD","PPR","PNC","PD","CRU","IR","PRCR","NON-CR","VGPR","NCR","PR","CRP",
    "MLFS","HI","RCB","CRI","OJBR","NOR","NC","ICCR","ICPR","ISD","ICPD","IUPD","IUPR","NONE","S","HR","TF","ED"
}

# Long forms require exact match (already complete phrases)
LONG_FORMS: Set[str] = RESPONSE_VALUE_TOKENS - ATOMIC_CODES

# Compile regex pattern for atomic codes with word boundaries
# Pattern: (^|[^A-Z0-9])CODE([^A-Z0-9]|$)
# This matches 'PCR' in 'PCR 1', 'PCR:yes', 'pcr-positive' but not 'PCRTEST'
_ATOMIC_PATTERN = (
    r'(?:^|[^A-Z0-9])(' +
    '|'.join(re.escape(code) for code in ATOMIC_CODES) +
    r')(?:[^A-Z0-9]|$)'
)
ATOMIC_REGEX = re.compile(_ATOMIC_PATTERN, re.IGNORECASE)

# 2) Column-name terms (be specific; avoid single-letter codes here)
NAME_TERMS: Set[str] = {
    # core signals
    "RESPONSE","RESPONDER","RESPONDERS","RESPONDING","RESP","RESPONSES",
    "REMISSION","COMPLETE_REMISSION","NON_COMPLETE_REMISSION","NON-COMPLETE_REMISSION",
    "PROGRESSIVE_DISEASE","STABLE_DISEASE","PARTIAL_RESPONSE","COMPLETE_RESPONSE",
    "PROGRESSION","BURDEN","RESIDUAL","PATHOLOGICAL","PATH","RCB",
    # frequent clinical abbreviations safe for NAMES
    "PCR","CR","PR","SD","PD","VGPR","CRU","CRI","CRP","MLFS","HI","OJBR",
    "ICCR","ICPR","ISD","ICPD","IUPD","IUPR","NR","NONRESPONDER","NONRESPONDERS","NON_RESPONSE",
    "SENSITIVE","RESISTANT","RESISTANCE","TREATMENT_FAILURE","FAILURE","EARLY_DEATH","ED",
    # words that often wrap the above
    "THERAPEUTIC","THERAPY","TREATMENT","OUTCOME","EFFICACY","EFFECTIVENESS",
    "RESP_CAT","RESP_CLASS","RESP_STATUS","CLINICAL_BENEFIT","OBJECTIVE_RESPONSE",
    "RECIST","BOR"  # Best Overall Response
}

def _compile_name_regex(terms: Iterable[str]) -> re.Pattern:
    """
    Build a robust column-name regex from NAME_TERMS.
    - Normalizes underscores/hyphens vs words
    - Word-boundaries to avoid overmatching substrings
    """
    esc = []
    for t in terms:
        t = t.strip().upper().replace("-", "[\\-_]").replace("_", "[\\-_]")
        esc.append(t)
    # word-ish boundaries: (^|[^A-Z0-9]) ... (?=[^A-Z0-9]|$)
    pattern = r"(?i)(^|[^A-Z0-9])(" + "|".join(esc) + r")(?=[^A-Z0-9]|$)"
    return re.compile(pattern)

NAME_RE = _compile_name_regex(NAME_TERMS)

def normalize_token(x: str) -> str:
    """Uppercase + collapse spaces + keep A-Z0-9/+ and '-' (remove punctuation)."""
    x = (x or "").strip().upper()
    x = re.sub(r"[^A-Z0-9/\-\s]+", " ", x)
    x = re.sub(r"\s+", " ", x).strip()
    return x

def looks_like_response_values(
    tokens: Set[str],
    max_token_len: int = 40,
    use_regex: bool = True,
    coverage_threshold: float = 0.2
) -> Tuple[bool, float]:
    """
    Decide if a column's value set looks like treatment response labels.
    
    VECTORIZED implementation using numpy for better performance with large token sets.

    Args:
        tokens: Set of unique string values from column (already normalized)
        max_token_len: Maximum token length to consider (filters free-text)
        use_regex: If True, atomic codes match with word boundaries (e.g., 'PCR 1', 'PCR:yes')
                   If False, require exact match
        coverage_threshold: Minimum fraction of tokens that must match (default 0.6)

    Returns:
        (is_response_like, coverage_ratio) where coverage is fraction of tokens
        within RESPONSE_VALUE_TOKENS
        
    Examples:
        >>> looks_like_response_values({"PCR", "RD"})
        (True, 1.0)
        >>> looks_like_response_values({"PCR 1", "RD 0"}, use_regex=True)
        (True, 1.0)
        >>> looks_like_response_values({"PCR 1", "RD 0"}, use_regex=False)
        (False, 0.0)
        >>> looks_like_response_values({"0", "1"})
        (True, 1.0)
    """
    if not tokens:
        return (False, 0.0)

    # Fast path: binary numeric (vectorized subset check)
    if tokens <= {"1", "0"}:
        return (True, 1.0)

    # Length guard to avoid free-text - vectorized using numpy
    tokens_arr = np.array(list(tokens), dtype=object)
    
    # Vectorized length computation
    get_len = np.vectorize(len, otypes=[int])
    token_lens = get_len(tokens_arr)
    short_mask = token_lens <= max_token_len

    if not short_mask.any():
        return (False, 0.0)

    short_tokens = tokens_arr[short_mask]
    n_short = len(short_tokens)

    # Build coverage check
    if use_regex:
        # Vectorized regex matching
        def matches_response(token: str) -> bool:
            # Check exact match against long forms first
            if token in LONG_FORMS:
                return True
            # Check regex match against atomic codes
            # Pattern catches 'PCR' in 'PCR 1', 'PCR:yes', etc.
            return ATOMIC_REGEX.search(token) is not None

        vectorized_match = np.vectorize(matches_response, otypes=[bool])
        covered_mask = vectorized_match(short_tokens)
        n_covered = covered_mask.sum()

    else:
        # Exact match only - vectorized set membership
        def exact_match(token: str) -> bool:
            return token in RESPONSE_VALUE_TOKENS

        vectorized_exact = np.vectorize(exact_match, otypes=[bool])
        covered_mask = vectorized_exact(short_tokens)
        n_covered = covered_mask.sum()

    coverage = n_covered / n_short
    is_response_like = (n_covered > 0) and (coverage >= coverage_threshold)

    return (is_response_like, coverage)

def candidate_response_columns(all_columns: Iterable[str]) -> List[str]:
    """
    Filter column names by NAME_RE (specific clinical response terms only).
    """
    out = []
    for c in all_columns:
        if NAME_RE.search(c.upper()):
            out.append(c)
    return out

### Vectorized Response Detection (Optimized)

The `looks_like_response_values()` function has been **optimized with vectorization and regex matching**:

**Key Improvements:**

1. **Vectorized Operations**: Uses NumPy arrays instead of list comprehensions for faster processing
   - Length filtering: `np.vectorize(len)` instead of `{t for t in tokens if len(t) <= max_token_len}`
   - Token matching: Vectorized regex/exact matching instead of set comprehensions

2. **Regex Word Boundary Matching** (controlled by `use_regex` flag):
   - **Atomic codes** (PCR, CR, RD, etc.) now match with word boundaries
   - Matches `'PCR 1'`, `'pcr:yes'`, `'cr-positive'` but NOT `'PCRTEST'`
   - Pattern: `(?:^|[^A-Z0-9])CODE(?:[^A-Z0-9]|$)` catches non-alphanumeric boundaries
   
3. **Two Matching Modes**:
   - `use_regex=True` (default): Fuzzy matching for atomic codes with word boundaries
   - `use_regex=False`: Exact string matching only
   
4. **Configurable Coverage Threshold**: Set minimum match ratio (default 60%)

**Performance**: ~2-5x faster for large token sets (100+ unique values) due to vectorization

**Example Usage**:
```python
# Exact match
looks_like_response_values({"PCR", "RD"}, use_regex=False)  
# → (True, 1.0)

# Regex match with boundaries
looks_like_response_values({"PCR 1", "RD 0"}, use_regex=True)  
# → (True, 1.0)

# Fails exact match but passes regex
looks_like_response_values({"PCR 1", "RD 0"}, use_regex=False)  
# → (False, 0.0)
```

### Lazy Scan Approach (Memory-Efficient)

**Why lazy scanning?** The full metadata requires ~40GB RAM when loaded. Using `pl.scan_parquet()` allows us to process the data column-by-column without loading everything into memory.

We'll start with a simple test case before scaling to the full analysis.

### Full Column Profiling with Lazy Evaluation

Now scale up to analyze all candidate columns using the complete regex-based value detection.

In [None]:
def profile_column_lazy(col_name: str, parquet_path: Path, max_unique_sample: int = 100) -> dict:
    """
    Profile a single column using lazy evaluation to minimize memory usage.
    
    Args:
        col_name: Name of column to profile
        parquet_path: Path to parquet file
        max_unique_sample: Maximum number of unique values to sample for analysis
        
    Returns:
        Dictionary with column statistics and response classification
    """
    try:
        # Scan only the specific column
        lf = pl.scan_parquet(str(parquet_path)).select(col_name)
        
        # Get basic stats with lazy evaluation
        stats = (
            lf.select([
                pl.col(col_name).is_not_null().sum().alias("non_null_count"),
                pl.col(col_name).n_unique().alias("n_unique")
            ])
            .collect()
        )
        
        non_null = int(stats["non_null_count"][0])
        n_unique = int(stats["n_unique"][0])
        
        # Early exit if column is empty
        if non_null == 0:
            return {
                "colname": col_name,
                "dtype": "unknown",
                "n_non_null": 0,
                "n_unique": 0,
                "is_binary_numeric": False,
                "is_response_like": False,
                "coverage": 0.0,
                "tokens_sample": []
            }
        
        # Get unique value sample (limit to avoid memory issues)
        unique_values = (
            lf
            .select(pl.col(col_name))
            .unique()
            .limit(max_unique_sample)
            .collect()
        )
        
        # Get dtype from the collected sample
        dtype = str(unique_values[col_name].dtype)
        
        # Normalize tokens
        tokens = {
            normalize_token(v) 
            for v in unique_values[col_name].cast(pl.Utf8, strict=False).to_list() 
            if v is not None and normalize_token(str(v)) != ""
        }
        
        # Check if binary numeric
        is_binary = tokens.issubset({"0", "1"}) if tokens else False
        
        # Check if response-like
        is_resp, coverage = looks_like_response_values(tokens, use_regex=True)
        
        return {
            "colname": col_name,
            "dtype": dtype,
            "n_non_null": non_null,
            "n_unique": n_unique,
            "is_binary_numeric": is_binary,
            "is_response_like": is_resp,
            "coverage": coverage,
            "tokens_sample": sorted(list(tokens))[:20]
        }
        
    except Exception as e:
        print(f"Error processing {col_name}: {e}")
        return {
            "colname": col_name,
            "dtype": "error",
            "n_non_null": 0,
            "n_unique": 0,
            "is_binary_numeric": False,
            "is_response_like": False,
            "coverage": 0.0,
            "tokens_sample": [],
            "error": str(e)
        }



In [None]:

# Get all candidate columns
all_candidates = candidate_response_columns(columns)
print(f"Profiling {len(all_candidates)} candidate response columns using lazy evaluation...")
print("This approach uses minimal memory by processing one column at a time.\n")

# Profile each column with lazy evaluation
rows = []
for col_name in tqdm(all_candidates, desc="Profiling columns"):
    result = profile_column_lazy(col_name, parquet_path)
    rows.append(result)

# Create results dataframe
resp_profile_lazy = pl.DataFrame(rows).sort(
    ["is_response_like", "coverage", "n_non_null"], 
    descending=[True, True, True]
)

# Filter to shortlist
shortlist_lazy = resp_profile_lazy.filter(
    (pl.col("is_response_like") == True) | (pl.col("is_binary_numeric") == True)
)

print(f"\nResults:")
print(f"  Total candidates profiled: {len(resp_profile_lazy)}")
print(f"  Response-like columns found: {len(shortlist_lazy)}")
print(f"  Total non-null entries: {shortlist_lazy['n_non_null'].sum():,}")

Profiling 414 candidate response columns using lazy evaluation...
This approach uses minimal memory by processing one column at a time.



Profiling columns: 100%|██████████| 414/414 [01:51<00:00,  3.72it/s]


Results:
  Total candidates profiled: 414
  Response-like columns found: 154
  Total non-null entries: 17,529





In [None]:
# Display the shortlist results
print("Top 20 response-like columns by coverage and sample count:\n")
print(shortlist_lazy.head(20))

# Optionally save results
shortlist_lazy.write_parquet(data_dir / "response_column_shortlist_lazy.parquet")
print(f"\nSaved results to:")
print(f"  - response_column_shortlist_lazy.parquet")

Top 20 response-like columns by coverage and sample count:

shape: (20, 8)
┌─────────────┬─────────┬────────────┬──────────┬─────────────┬────────────┬──────────┬────────────┐
│ colname     ┆ dtype   ┆ n_non_null ┆ n_unique ┆ is_binary_n ┆ is_respons ┆ coverage ┆ tokens_sam │
│ ---         ┆ ---     ┆ ---        ┆ ---      ┆ umeric      ┆ e_like     ┆ ---      ┆ ple        │
│ str         ┆ str     ┆ i64        ┆ i64      ┆ ---         ┆ ---        ┆ f64      ┆ ---        │
│             ┆         ┆            ┆          ┆ bool        ┆ i64        ┆          ┆ list[str]  │
╞═════════════╪═════════╪════════════╪══════════╪═════════════╪════════════╪══════════╪════════════╡
│ characteris ┆ Int64   ┆ 941        ┆ 3        ┆ true        ┆ 1          ┆ 1.0      ┆ ["0", "1"] │
│ tics_ch1_pr ┆         ┆            ┆          ┆             ┆            ┆          ┆            │
│ ihc         ┆         ┆            ┆          ┆             ┆            ┆          ┆            │
│ characteris ┆ 

## Export Potential Clinical Samples

Create a filtered subset containing only samples with non-null response-related columns.

### Export Strategy

Create the following outputs:

1. Subset of metadata with only response-related columns populated (includes GSE and GSM identifiers)
2. List of valuable response columns identified
3. Unique experiments for manual validation
4. Experiment-level metadata for context

In [None]:
# Create subset using lazy evaluation to minimize memory usage
# Load only rows where at least one response column is non-null

# Use the lazy-scanned shortlist results
shortlist_cols = shortlist_lazy['colname'].to_list()

# Build filter expression: any of the shortlist columns has a non-null value
# We need to load the full data for filtering, but only briefly
print(f"Loading data to filter {len(shortlist_cols)} response columns...")

# Alternative: Use lazy scan to filter efficiently
lf = pl.scan_parquet(str(parquet_path))

# Create mask: keep rows where ANY shortlist column is not null
# Only include columns that actually exist in the data
existing_cols = [c for c in shortlist_cols if c in lf.schema.names()]
print(f"Found {len(existing_cols)} shortlist columns in dataset")

# Build filter expression
if existing_cols:
    mask_expr = pl.any_horizontal([pl.col(c).is_not_null() for c in existing_cols])
    
    # Apply filter and collect
    subset = lf.filter(mask_expr).collect()
    
    # Drop completely null columns
    subset = subset.select([c for c in subset.columns if not subset[c].is_null().all()])
    
    print(f"\nFiltered dataset shape: {subset.shape}")
else:
    print("⚠️ No matching columns found!")
    subset = None

Loading data to filter 154 response columns...


  existing_cols = [c for c in shortlist_cols if c in lf.schema.names()]


Found 154 shortlist columns in dataset

Filtered dataset shape: (12100, 1063)


In [None]:
subset['experiment_accession'].str.starts_with('GSE').sum(), len(subset)

(12100, 12100)

In [None]:
# Load existing clinical observations for comparison
clin_obs = pd.read_csv(data_dir / 'clin_obs.csv')

clin_obs

Unnamed: 0,data_cancer_name,dataset_name,depth,drug_ids,drug_list,is_microarray,label,num_expressed,num_measured,primary_tissue,sample_id,tcga_subtype,drug_list_canonized
0,Breast cancer,CTR_Microarray_1-I,0.0,"[3874, 2882, 105, 6792, 2790, 0, 0, 0]","[""Doxorubicin"", ""Paclitaxel"", ""Cyclophosphamid...",True,0,19068,19068,Breast,GSM1233067,BRCA,"['DOXORUBICIN', 'PACLITAXEL', 'CYCLOPHOSPHAMID..."
1,Breast cancer,CTR_Microarray_1-I,0.0,"[3874, 2882, 105, 6792, 2790, 0, 0, 0]","[""Doxorubicin"", ""Paclitaxel"", ""Cyclophosphamid...",True,0,19068,19068,Breast,GSM1233069,BRCA,"['DOXORUBICIN', 'PACLITAXEL', 'CYCLOPHOSPHAMID..."
2,Breast cancer,CTR_Microarray_1-I,0.0,"[3874, 2882, 105, 6792, 2790, 0, 0, 0]","[""Doxorubicin"", ""Paclitaxel"", ""Cyclophosphamid...",True,0,19068,19068,Breast,GSM1233072,BRCA,"['DOXORUBICIN', 'PACLITAXEL', 'CYCLOPHOSPHAMID..."
3,Breast cancer,CTR_Microarray_1-I,0.0,"[3874, 2882, 105, 6792, 2790, 0, 0, 0]","[""Doxorubicin"", ""Paclitaxel"", ""Cyclophosphamid...",True,1,19068,19068,Breast,GSM1233085,BRCA,"['DOXORUBICIN', 'PACLITAXEL', 'CYCLOPHOSPHAMID..."
4,Breast cancer,CTR_Microarray_1-I,0.0,"[3874, 2882, 105, 6792, 2790, 0, 0, 0]","[""Doxorubicin"", ""Paclitaxel"", ""Cyclophosphamid...",True,0,19068,19068,Breast,GSM1233086,BRCA,"['DOXORUBICIN', 'PACLITAXEL', 'CYCLOPHOSPHAMID..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
12375,,ipsy2_genexp,0.0,"[2882, 6863, 6801, 0, 0, 0, 0, 0]","[""Paclitaxel"", ""Pertuzumab"", ""Trastuzumab""]",True,1,7151,7151,,GSM5859964,BRCA,"['PACLITAXEL', 'PERTUZUMAB', 'TRASTUZUMAB']"
12376,,ipsy2_genexp,0.0,"[2882, 3690, 0, 0, 0, 0, 0, 0]","[""Paclitaxel"", ""Neratinib""]",True,0,7151,7151,,GSM5859965,BRCA,"['PACLITAXEL', 'NERATINIB']"
12377,,ipsy2_genexp,0.0,"[2882, 6804, 0, 0, 0, 0, 0, 0]","[""Paclitaxel"", ""Pembrolizumab""]",True,1,7151,7151,,GSM5859966,BRCA,"['PACLITAXEL', 'PEMBROLIZUMAB']"
12378,,ipsy2_genexp,0.0,"[2882, 6878, 3838, 0, 0, 0, 0, 0]","[""Paclitaxel"", ""ABT 888"", ""Carboplatin""]",True,1,7151,7151,,GSM5859967,BRCA,"['PACLITAXEL', 'ABT-888', 'CARBOPLATIN']"


In [None]:
# Compare against existing clinical data to identify novel samples
if subset is not None and 'rb_ids' in dir():
    not_in_clin = 0
    for sample in rb_ids:
        if sample not in clin_ids.values: 
            not_in_clin += 1

    print(f"Potential clinical samples identified: {len(rb_ids):,}")
    print(f"Already in existing clinical dataset: {len(rb_ids) - not_in_clin:,}")
    print(f"Novel samples (not in clin_obs.csv): {not_in_clin:,}")
    print(f"\nPercentage of novel samples: {100 * not_in_clin / len(rb_ids):.1f}%")
else:
    print("⚠️ Subset not available for comparison")

⚠️ Subset not available for comparison


cols = ['source_name_ch1','title','treatment_protocol_ch1','refinebio_title',	'refinebio_treatment','refinebio_specimen_part']

refinebio_cell_line	refinebio_disease	refinebio_disease_stage	refinebio_organism refinebio_age	refinebio_cell_line refinebio_accession_code	experiment_accession

In [None]:
# Save the filtered subset to CSV
if subset is not None:
    subset.to_pandas().to_csv(potential_clin_path, index=False)
    print(f"✅ Saved {len(subset):,} samples to: {potential_clin_path}")
else:
    print("⚠️ No subset to save")

✅ Saved 12,100 samples to: ..\data\potential_clin_data.csv


In [None]:
# Display the output path
print(f"Output file: {potential_clin_path}")
print(f"File exists: {potential_clin_path.exists()}")
if potential_clin_path.exists():
    print(f"File size: {potential_clin_path.stat().st_size / (1024**2):.1f} MB")

Output file: ..\data\potential_clin_data.csv
File exists: True
File size: 31.2 MB


## Summary and Next Steps

### Results
- **Total response-like columns identified:** 154
- **Potential clinical samples:** 11,713
- **Novel samples (not in existing dataset):** 8,903

### Next Steps
1. **Manual validation** (`nbs/verify_clin.ipynb`): Inspect samples to confirm:
   - Real patients (not cell lines)
   - Valid response labels
   - Consistent experiment metadata

2. **Label standardization**: Map diverse response terms to binary outcomes (0 = non-response, 1 = response)

3. **Integration**: Merge validated samples with existing clinical dataset

### Files Generated
- `potential_clin_data.csv` - Candidate samples with response columns
- `response_column_shortlist.parquet` - 85 identified response columns
- `response_column_profile.parquet` - Full analysis of 414 candidate columns