#### Single-Cell RNA-seq Data Merging Pipeline

**Purpose**:  
Merge multiple single-cell RNA-seq count files (CSV format) into a consolidated dataset with metadata.

**Key Steps**:
1. **Input**: 14 raw count files (format: `GSMXXXXX_[Condition]_raw_counts.csv`)
2. **Metadata Extraction**:  
   - Sample ID (from filename)  
   - Condition (`Control`/`COVID` inferred from filename)
3. **Output**:  
   - Merged CSV (`merged_counts.csv`) with structure:  
     - Rows = Cells  
     - Columns = Genes + `sample`/`condition` metadata

**Requirements**:  
`pandas`, `pathlib`, `re`

In [8]:
#Import packages
import pandas as pd #For loading and manipulating tables
import glob #For listing multiple file names matching a pattern.
import os # For handling file paths (cross-platform-safe)
from pathlib import Path
import re

In [9]:
ls

 Volume in drive E is Volume
 Volume Serial Number is A45F-3E36

 Directory of E:\COVID_LUNGS\notebooks

26/07/2025  15:38    <DIR>          .
26/07/2025  15:38    <DIR>          ..
26/07/2025  15:04    <DIR>          .ipynb_checkpoints
26/07/2025  15:38            14,519 merge_counts.ipynb
               1 File(s)         14,519 bytes
               3 Dir(s)  716,024,213,504 bytes free


In [10]:
# Step 1:Define the data directory (where your 14 raw CSV files are)
DATA_DIR = Path("../data")  # Adjust path if needed
OUTPUT_FILE = DATA_DIR / "merged_counts.csv"

In [12]:
# Step 2: List all .csv files in the data folder
files = sorted(DATA_DIR.glob("*.csv"))  # Process all 14 CSVs

print(f"Found {len(files)} CSV files.")

if not files:
    raise FileNotFoundError("No CSV files found in the 'data/' directory.")

# Create a list to hold all sample DataFrames
all_samples_dfs = []

Found 14 CSV files.


In [13]:
# Step 3: Process each file — extract metadata and load data
for f in files:
    filename = f.name

    # Step 3.1: Extract sample ID from filename
    match = re.match(r"(GSM\d+_.*?)(_raw_counts\.csv)", filename, re.IGNORECASE)
    if not match:
        print(f"Skipping '{filename}' — filename format not recognized.")
        continue

    sample_id = match.group(1)

    # Step 3.2: Determine condition from filename
    if "ctr" in sample_id.lower():
        condition = "Control"
    elif "cov" in sample_id.lower():
        condition = "COVID"
    else:
        print(f"Skipping '{filename}' — no condition found.")
        continue

    print(f"Processing: {filename} | Sample: {sample_id} | Condition: {condition}")

    try:
        # Step 3.3: Read the CSV file and transpose it to cells × genes
        df = pd.read_csv(f, index_col=0).T
        df["sample"] = sample_id
        df["condition"] = condition
        all_samples_dfs.append(df)
    except Exception as e:
        print(f"Error reading '{filename}': {e}")

Processing: GSM5226574_C51ctr_raw_counts.csv | Sample: GSM5226574_C51ctr | Condition: Control
Processing: GSM5226575_C52ctr_raw_counts.csv | Sample: GSM5226575_C52ctr | Condition: Control
Processing: GSM5226576_C53ctr_raw_counts.csv | Sample: GSM5226576_C53ctr | Condition: Control
Processing: GSM5226577_C54ctr_raw_counts.csv | Sample: GSM5226577_C54ctr | Condition: Control
Processing: GSM5226578_C55ctr_raw_counts.csv | Sample: GSM5226578_C55ctr | Condition: Control
Processing: GSM5226579_C56ctr_raw_counts.csv | Sample: GSM5226579_C56ctr | Condition: Control
Processing: GSM5226580_C57ctr_raw_counts.csv | Sample: GSM5226580_C57ctr | Condition: Control
Processing: GSM5226581_L01cov_raw_counts.csv | Sample: GSM5226581_L01cov | Condition: COVID
Processing: GSM5226582_L03cov_raw_counts.csv | Sample: GSM5226582_L03cov | Condition: COVID
Processing: GSM5226583_L04cov_raw_counts.csv | Sample: GSM5226583_L04cov | Condition: COVID
Processing: GSM5226584_L04covaddon_raw_counts.csv | Sample: GSM522

In [14]:
# Step 4: Merge all individual DataFrames
if not all_samples_dfs:
    raise RuntimeError("No valid dataframes to merge.")

# Merge into one large DataFrame
merged_df = pd.concat(all_samples_dfs)

# Step 4.1: Move metadata columns to the front
meta_cols = ["sample", "condition"]
gene_cols = [col for col in merged_df.columns if col not in meta_cols]
merged_df = merged_df[meta_cols + gene_cols]

# Step 4.2: Save to CSV
merged_df.to_csv(OUTPUT_FILE)
print(f"Merged dataset saved to: {OUTPUT_FILE}")
print(f"Shape: {merged_df.shape} — Rows = cells, Columns = genes + metadata")

Merged dataset saved to: ..\data\merged_counts.csv
Shape: (68710, 34548) — Rows = cells, Columns = genes + metadata
