# **Introduction**

This notebook processes single-cell RNA sequencing data from the GSE300475 dataset to prepare it for downstream multimodal analysis. It covers downloading and extracting the raw data files, loading and annotating the gene expression matrix with patient and response metadata, and performing normalization, log-transformation, and selection of highly variable genes to reduce noise and focus on informative features. Key preprocessing steps are visualized, including gene variability and dimensionality reduction using PCA, both before and after filtering. Finally, the processed gene expression data is exported along with cell-level metadata to support integration with other data modalities and enable machine learning modeling of treatment response in breast cancer.

**Step 1**: *Reset Google Drive Mount in Colab*

* This step ensures that any previous Google Drive mount is safely removed before starting fresh.
* First, the code tries to unmount Google Drive using `drive.flush_and_unmount()`. If Drive isn’t mounted yet, it catches and prints the exception.
* Then it checks if the `/content/drive` folder still exists, and removes it using `shutil.rmtree()`. This clears any leftover mount point data.
* This is helpful when switching accounts, resolving permission errors, or restarting workflows cleanly.
* After this step, you're ready to freshly mount Google Drive again.

In [None]:
import shutil
import os
import scanpy as sc
from google.colab import drive
import tarfile

# Try unmount if already mounted
try:
    drive.flush_and_unmount()
except Exception as e:
    print("Unmount failed or not mounted yet:", e)

# Now remove the folder manually if it still exists
if os.path.exists("/content/drive"):
    shutil.rmtree("/content/drive", ignore_errors=True)

print("Cleaned /content/drive. Now you can mount again.")

Drive not mounted, so nothing to flush and unmount.
Cleaned /content/drive. Now you can mount again.


**Step 2**: *Mount Google Drive*

* After cleaning up any previous mount (Step 1), we now freshly mount Google Drive into the Colab environment.
* The method `drive.mount('/content/drive')` prompts you to authenticate using your Google account.
* Once authenticated, it creates a virtual mount point at `/content/drive` where all your Drive files can be accessed just like a local directory.
* This is essential for reading datasets, saving outputs, or loading pre-existing files from Google Drive.

In [None]:
drive.mount('/content/drive')

Mounted at /content/drive


**Step 3**: *Download and Extract TCR Data Archive from GEO*

* This step handles the download and extraction of the **TCR-seq dataset (GSE300475)** from the NCBI GEO repository.
* We **force re-mount** Google Drive (`force_remount=True`) to ensure a clean mount in case of residual connections or folder conflicts.
* The `.tar` archive is then downloaded directly into your Google Drive under `/MyDrive`.
* We use Python's `tarfile` module to extract the contents to a new folder, `/MyDrive/GSE300475_extracted`, making all files accessible for downstream processing.

**File downloaded**:
`GSE300475_RAW.tar` (566 MB)

**Extraction folder**:
`/content/drive/MyDrive/GSE300475_extracted`

**Expected Output Summary**:

* `Mounted at /content/drive`
* Confirmation of successful download via `wget`
* `Extraction complete.` after untarring the contents

In [None]:
# Mount Google Drive (force remount to avoid folder conflict)
drive.mount('/content/drive', force_remount=True)

# Change this path to where you want to store it
download_path = '/content/drive/MyDrive/Data/rawdata/GSE300475_RAW.tar'

# Download the TAR archive
!wget -O "$download_path" "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE300nnn/GSE300475/suppl/GSE300475_RAW.tar"

# Extract it
extract_path = '/content/drive/MyDrive/Data/rawdata'
with tarfile.open(download_path, 'r') as tar:
    tar.extractall(path=extract_path)

print("Extraction complete.")

Mounted at /content/drive
--2025-10-04 18:33:38--  https://ftp.ncbi.nlm.nih.gov/geo/series/GSE300nnn/GSE300475/suppl/GSE300475_RAW.tar
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.11, 130.14.250.12, 130.14.250.13, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.11|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 592977920 (566M) [application/x-tar]
Saving to: ‘/content/drive/MyDrive/Data/rawdata/GSE300475_RAW.tar’


2025-10-04 18:33:59 (27.7 MB/s) - ‘/content/drive/MyDrive/Data/rawdata/GSE300475_RAW.tar’ saved [592977920/592977920]



  tar.extractall(path=extract_path)


Extraction complete.


**Step 4**: *Install Scanpy Library*

* This step installs the `scanpy` package, which is a comprehensive Python library for analyzing single-cell RNA-seq data.

* It provides tools for data preprocessing, dimensionality reduction, clustering, visualization, and differential expression analysis.

* Installation is done using `pip` and will work only within the current Colab session unless re-installed after a restart.

* Installed package: `scanpy`

* Automatically includes key dependencies such as `anndata`, `numpy`, `pandas`, `scikit-learn`, `matplotlib`, and `scipy`.

**Step 5**: *Load and Combine Single-Cell Gene Expression Data from Multiple Samples*

* The extracted directory containing 10X Genomics formatted data is scanned to identify all samples by detecting barcode files (`barcodes.tsv.gz`).

* Sample prefixes are automatically extracted from filenames for batch loading.

* Each sample is loaded individually using Scanpy’s `read_10x_mtx` function with gene IDs as variable names.

* A new observation column `sample_id` is added to each AnnData object to keep track of the sample origin.

* All sample AnnData objects are concatenated into a single combined AnnData object for unified downstream analysis.

* A warning about duplicate observation (cell) names appears because different samples may have overlapping barcode IDs; this can be resolved later if needed by calling `.obs_names_make_unique()`.

* The combined dataset contains approximately 100,067 cells and 36,601 genes.

* The final combined AnnData object is saved to Google Drive in `.h5ad` format for persistent storage and future use.

* Input path: `/content/drive/MyDrive/GSE300475_extracted`

* Output file: `/content/drive/MyDrive/MultimodalCSVs/gene_expression_combined_raw.h5ad`

In [None]:
# Path to the extracted directory
extract_path = '/content/drive/MyDrive/Data/rawdata'

# List all barcodes files to detect the sample prefixes automatically
sample_prefixes = sorted([
    f.split('_')[0] + "_" + f.split('_')[1]
    for f in os.listdir(extract_path)
    if f.endswith('barcodes.tsv.gz')
])

# Load each sample individually
adatas = []

for sample in sample_prefixes:
    print(f"Loading: {sample}")
    adata = sc.read_10x_mtx(
        extract_path,
        var_names='gene_ids',
        prefix=sample + "_",
        cache=True
    )
    adata.obs['sample_id'] = sample
    adatas.append(adata)

# Concatenate all AnnData objects
print("Concatenating samples...")
combined_adata = sc.concat(adatas, label='sample_id', keys=sample_prefixes)

print(f"Combined shape: {combined_adata.shape}")
print("Saving combined AnnData to disk...")

# Save the combined AnnData to Google Drive
save_path = '/content/drive/MyDrive/Data/rawdata/gene_expression_combined_raw.h5ad'
os.makedirs('/content/drive/MyDrive/Data/rawdata', exist_ok=True)
combined_adata.write(save_path)

print(f"Saved to: {save_path}")

Loading: GSM9061665_S1
Loading: GSM9061666_S2
Loading: GSM9061667_S3
Loading: GSM9061668_S4
Loading: GSM9061669_S5
Loading: GSM9061670_S6
Loading: GSM9061671_S7
Loading: GSM9061672_S8
Loading: GSM9061673_S9
Loading: GSM9061674_S10
Loading: GSM9061675_S11
Concatenating samples...


  utils.warn_names_duplicates("obs")


Combined shape: (100067, 36601)
Saving combined AnnData to disk...
Saved to: /content/drive/MyDrive/Data/rawdata/gene_expression_combined_raw.h5ad


**Step 6**: *Annotate Cells with Patient ID and Treatment Response Labels*

* The previously saved combined AnnData object is loaded from disk.

* A mapping dictionary links the original sample IDs (`sample_id`) to patient IDs (`patient_id`), standardizing sample names to meaningful patient codes (e.g., `"GSM9061665_S1"` → `"PT1"`).

* Known responder and non-responder patient sets are defined based on clinical metadata.

* Each cell is annotated with its corresponding patient ID using the mapping.

* A new categorical column `response` is added, classifying cells as `"Responder"`, `"Non-responder"`, or `"Unknown"` depending on patient membership in responder/non-responder groups.

* Cells with unknown response status (e.g., samples labeled `"Week1"` or `"Week3"`) are filtered out to focus the analysis on well-defined response groups.

* The filtered and annotated AnnData object is saved back to Google Drive.

* The final dataset contains 58,177 cells and 36,601 genes, with about 19,201 responder cells and 38,976 non-responder cells.

* Input file: `/content/drive/MyDrive/MultimodalCSVs/gene_expression_combined_raw.h5ad`

* Output file: `/content/drive/MyDrive/MultimodalCSVs/gene_expression_annotated.h5ad`

In [None]:
import pandas as pd

# Load your saved raw data
adata = sc.read_h5ad("/content/drive/MyDrive/Data/rawdata/gene_expression_combined_raw.h5ad")

# Mapping from GSM ID to patient ID
gsm_to_patient = {
    "GSM9061665_S1": "PT1",
    "GSM9061666_S2": "PT6",
    "GSM9061667_S3": "PT7",
    "GSM9061668_S4": "PT13",
    "GSM9061669_S5": "PT15",
    "GSM9061670_S6": "Week3",
    "GSM9061671_S7": "Week3_addition",
    "GSM9061672_S8": "PT15_add",
    "GSM9061673_S9": "PT11",
    "GSM9061674_S10": "PT5",
    "GSM9061675_S11": "Week1"
}

# Known responders/non-responders
responder_pts = {"PT1", "PT7", "PT15"}
non_responder_pts = {"PT5", "PT6", "PT11", "PT13"}

# Add patient_id to obs
adata.obs['patient_id'] = adata.obs['sample_id'].map(gsm_to_patient)

# Add response label
def classify_response(pid):
    if pid in responder_pts:
        return "Responder"
    elif pid in non_responder_pts:
        return "Non-responder"
    else:
        return "Unknown"

adata.obs['response'] = adata.obs['patient_id'].map(classify_response)

# Remove cells with unknown response
adata = adata[adata.obs['response'] != "Unknown"].copy()

# Save the annotated file
annotated_path = "/content/drive/MyDrive/Data/rawdata/gene_expression_annotated.h5ad"
adata.write(annotated_path)
print(f"Annotated AnnData saved to:\n{annotated_path}")
print(f"Final shape: {adata.shape}")
adata.obs['response'].value_counts()

  utils.warn_names_duplicates("obs")
  utils.warn_names_duplicates("obs")


Annotated AnnData saved to:
/content/drive/MyDrive/Data/rawdata/gene_expression_annotated.h5ad
Final shape: (58177, 36601)


Unnamed: 0_level_0,count
response,Unnamed: 1_level_1
Non-responder,38976
Responder,19201
