### 📘 How to Use This Notebook with GEO Datasets (Manual Download)

To run CNV analysis using **CopyKAT**, you will need to manually download gene expression data from **NCBI GEO (Gene Expression Omnibus)**.

#### 📥 Manual Download Instructions

1. Visit **[https://www.ncbi.nlm.nih.gov/geo/](https://www.ncbi.nlm.nih.gov/geo/)**  
2. Search for your dataset of interest (e.g., `GSE178341`)
3. Go to the **“Supplementary files”** section
4. Download expression files such as:
   - `.h5` (10X Genomics format)
   - `.h5ad` (AnnData)
   - `.mtx`, `barcodes.tsv`, and `features.tsv`
   - or `.csv` if available

After downloading, upload the files using the button below to begin preprocessing.

---

### 🔗 Dataset Selection Tips (Immunova Integration)

This notebook is designed to support **Immunova's 4-module pipeline**, especially the **Treatment Response** and **Survival Analysis** modules.

To ensure compatibility:

- ✅ Select **human tumor datasets** with single-cell resolution  
- ✅ Prefer datasets from **immunotherapy-treated patients** or include **TME (tumor microenvironment)** cells  
- ✅ Include **T cells, B cells, macrophages**, or **tumor cells** for CopyKAT CNV boundary detection

#### Example:
- `GSE178341` — Pan-cancer scRNA-seq data (colorectal focus)
- `GSE150728`, `GSE120575` — Melanoma or lung tumor microenvironment datasets



# 🧬 CopyKAT Preprocessing Notebook (Multi-format Compatible)
This notebook allows you to upload **any common single-cell RNA-seq expression file format** and convert it to a `gene × cell` matrix that can be used with the **CopyKAT CNV analysis tool in R**.

---

## ✅ Supported Input Formats
- `.h5ad` – AnnData HDF5 format
- `.h5` – 10x Genomics HDF5 format
- `.mtx` + `barcodes.tsv` + `features.tsv` – Sparse Matrix + Metadata
- `.csv` – Generic gene expression table (either genes × cells or cells × genes)

Each format is auto-detected, preprocessed, and exported to `adata_for_copykat.csv` for use in CopyKAT.


### 🧩 Step 1: Install Required Packages

In [None]:
!pip install scanpy anndata scipy

### 📥 Step 2: Upload Expression Files

In [None]:
from google.colab import files
uploaded = files.upload()

### 🔍 Step 3: Auto-detect Format and Load Data

In [None]:
import scanpy as sc
import pandas as pd
import numpy as np

file_names = list(uploaded.keys())
adata = None

if any(f.endswith('.h5ad') for f in file_names):
    f = [x for x in file_names if x.endswith('.h5ad')][0]
    print(f"📘 Detected .h5ad file: {f}")
    adata = sc.read_h5ad(f)

elif any(f.endswith('.h5') for f in file_names):
    f = [x for x in file_names if x.endswith('.h5')][0]
    print(f"📘 Detected 10X .h5 file: {f}")
    adata = sc.read_10x_h5(f)
    adata.var_names_make_unique()

elif any(f.endswith('.mtx') for f in file_names):
    print("📘 Detected 10X .mtx format")
    adata = sc.read_10x_mtx('./', var_names='gene_symbols')
    adata.var_names_make_unique()

elif any(f.endswith('.csv') for f in file_names):
    f = [x for x in file_names if x.endswith('.csv')][0]
    print(f"📘 Detected CSV file: {f}")
    df = pd.read_csv(f, index_col=0)
    df = df.T  # Transpose to cells × genes
else:
    raise ValueError("❌ Unsupported or incomplete file set.")

### 🧬 Step 4: Wrap CSV in AnnData (if used)

In [None]:
if adata is None and 'df' in locals():
    adata = sc.AnnData(X=df.values)
    adata.var_names = df.columns
    adata.obs_names = df.index
    print("✅ AnnData object created from CSV.")

### 💾 Step 5: Convert to CopyKAT-Compatible Format

In [None]:
df_copykat = pd.DataFrame(
    adata.X.T.toarray(),
    index=adata.var_names,
    columns=adata.obs_names
)
df_copykat.to_csv("adata_for_copykat.csv")
print("📁 Exported: adata_for_copykat.csv")

### 🧪 Step 6: Set Up R Environment and Install CopyKAT

In [None]:
%%capture
%%R
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("copykat")
install.packages("data.table")

### 🧬 Step 7: Load Expression Matrix and Run CopyKAT

In [None]:
%%R
library(copykat)
library(data.table)

# Load the gene × cell expression CSV
expr <- fread("adata_for_copykat.csv", data.table = FALSE)
rownames(expr) <- expr[[1]]
expr[[1]] <- NULL

# Run CopyKAT
result <- copykat(rawmat = expr, id.type = "S", sam.name = "Sample1")

# Save prediction
write.csv(result$prediction, "copykat_prediction.csv")

### 💾 Step 8: Download the CopyKAT Prediction Output

In [None]:
from google.colab import files
files.download("copykat_prediction.csv")