# Structure Keolids Single Cell Data

This notebook organizes the single cell RNA-seq data collected in [Single cell transcriptomics reveals the cellular heterogeneity of keloids and the mechanism of their aggressiveness](https://www.nature.com/articles/s42003-024-07311-1#data-availability) into an `AnnData` object. These data can be found at NCBI’s Gene Expression Omnibus (GSE243716).

# Download Data

**TODO:** The output paths within the below cell should be set specifically to your file system (i.e. replace any path that begins `/nfs/turbo/...`)

In [1]:
!wget "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE243716&format=file" -O /nfs/turbo/umms-indikar/Joshua/differentialExpression/GSE243716.tar
!mkdir /nfs/turbo/umms-indikar/Joshua/differentialExpression/GSE243716/
!tar -xvf /nfs/turbo/umms-indikar/Joshua/differentialExpression/GSE243716.tar -C /nfs/turbo/umms-indikar/Joshua/differentialExpression/GSE243716/

--2025-04-10 00:14:52--  https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE243716&format=file
Resolving proxy1.arc-ts.umich.edu (proxy1.arc-ts.umich.edu)... 141.211.192.53
Connecting to proxy1.arc-ts.umich.edu (proxy1.arc-ts.umich.edu)|141.211.192.53|:3128... connected.
Proxy request sent, awaiting response... 200 OK
Length: 245596160 (234M) [application/x-tar]
Saving to: ‘/nfs/turbo/umms-indikar/Joshua/differentialExpression/GSE243716.tar’


2025-04-10 00:14:55 (83.2 MB/s) - ‘/nfs/turbo/umms-indikar/Joshua/differentialExpression/GSE243716.tar’ saved [245596160/245596160]

GSM7794710_K-barcodes.tsv.gz
GSM7794710_K-features.tsv.gz
GSM7794710_K-matrix.mtx.gz
GSM7794711_H-barcodes.tsv.gz
GSM7794711_H-features.tsv.gz
GSM7794711_H-matrix.mtx.gz


After running this code, the following directory structure should appear

# Package as AnnData and Save to File

**TODO:** replace the `DATAPATH` with a location on your computer.

In [10]:
import os
import anndata as ad
import scanpy as sc

# === File Paths ===
DATAPATH = "/nfs/turbo/umms-indikar/Joshua/differentialExpression/GSE243716/"
OUTFILE = os.path.join(DATAPATH, "keloids.h5ad")

# === Load Data ===
# K sample
adata_K = sc.read_10x_mtx(
    os.path.join(DATAPATH),
    var_names='gene_ids',
    prefix='GSM7794710_K-'
)

# H sample
adata_H = sc.read_10x_mtx(
    os.path.join(DATAPATH),
    var_names='gene_ids',
    prefix='GSM7794711_H-'
)

# Add sample identifiers
adata_K.obs['sample'] = 'K'
adata_H.obs['sample'] = 'H'

# === Merge ===
adata_combined = ad.concat([adata_K, adata_H], join='outer')

# === Save to .h5ad ===
adata_combined.write(OUTFILE)
print(f"[+] Merged AnnData saved to: {OUTFILE}")

  utils.warn_names_duplicates("obs")


[+] Merged AnnData saved to: /nfs/turbo/umms-indikar/Joshua/differentialExpression/GSE243716/keloids.h5ad


# Validation

In [11]:
adata_check = sc.read_h5ad(OUTFILE)

print("[+] Reload successful!")
print("Shape:", adata_combined.shape)
print("Shape:", adata_check.shape)
print("Sample counts:\n", adata_check.obs['sample'].value_counts())


[+] Reload successful!
Shape: (25855, 36601)
Shape: (25855, 36601)
Sample counts:
 sample
H    16385
K     9470
Name: count, dtype: int64


  utils.warn_names_duplicates("obs")
