# Notebook to pack Zhang68k data from x10cell into an AnnData (h5ad) file.  



This notebook assumes that it is ran from the `bmfm-mammal-release/mammal/examples/scrna_cell_type` directory.  

In [None]:
# check the current directory. 
# Notice that the `biomed-multi-alignment` will probably be placed in a different location on your system.

# this is just a fancy pwd, replace with !pwd if you need
!print -D $PWD

~/git/biomed-multi-alignment/mammal/examples/scrna_cell_type/data



## Obtaining the source data:
The main data is availble online, for example in the [10xgenomics](https://www.10xgenomics.com/) cite.  The lables are based on the data in [LINK](https://www.10xgenomics.com/datasets/fresh-68-k-pbm-cs-donor-a-1-standard-1-1-0)

From this download the file `fresh_68k_pbmc_donor_a_filtered_gene_bc_matrices.tar.gz` and place it in this directory. 

In [None]:
!ls -sh

total 243088
     8 README.md
243056 fresh_68k_pbmc_donor_a_filtered_gene_bc_matrices.tar.gz
    24 zhang_data_prep.ipynb


#### Unzip the file.

In [None]:
!tar -xzvf fresh_68k_pbmc_donor_a_filtered_gene_bc_matrices.tar.gz 

x filtered_matrices_mex/
x filtered_matrices_mex/hg19/
x filtered_matrices_mex/hg19/barcodes.tsv
x filtered_matrices_mex/hg19/genes.tsv
x filtered_matrices_mex/hg19/matrix.mtx


#### download the labels file from a git repository in https://github.com/scverse/scanpy_usage

In [None]:
!wget https://raw.githubusercontent.com/scverse/scanpy_usage/refs/heads/master/170503_zheng17/data/zheng17_bulk_lables.txt

--2025-02-18 12:15:10--  https://raw.githubusercontent.com/scverse/scanpy_usage/refs/heads/master/170503_zheng17/data/zheng17_bulk_lables.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8003::154, 2606:50c0:8000::154, 2606:50c0:8001::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8003::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1348927 (1.3M) [text/plain]
Saving to: ‘zheng17_bulk_lables.txt’


2025-02-18 12:15:11 (3.59 MB/s) - ‘zheng17_bulk_lables.txt’ saved [1348927/1348927]



  You should now have a directoy called `filtered_matrices_mex/hg19` 

In [None]:
!ls -shR --color=never


total 245728
     8 README.md
     0 filtered_matrices_mex
243056 fresh_68k_pbmc_donor_a_filtered_gene_bc_matrices.tar.gz
    24 zhang_data_prep.ipynb
  2640 zheng17_bulk_lables.txt

./filtered_matrices_mex:
total 0
0 hg19

./filtered_matrices_mex/hg19:
total 993144
  2280 barcodes.tsv	  1600 genes.tsv	989264 matrix.mtx


The output should be something like

```
total 245728
     8 README.md
     0 filtered_matrices_mex
243056 fresh_68k_pbmc_donor_a_filtered_gene_bc_matrices.tar.gz
    24 zhang_data_prep.ipynb
  2640 zheng17_bulk_lables.txt

./filtered_matrices_mex:
total 0
0 hg19

./filtered_matrices_mex/hg19:
total 993144
  2280 barcodes.tsv	  1600 genes.tsv	989264 matrix.mtx

## Pack the data into an AnnData file 

In [None]:
import anndata 
# from collections import Counter
import pandas as pd
# import matplotlib.pyplot as plt
# import math
import numpy as np
# from scipy.sparse import csr_matrix
from scipy.io import mmread
import scanpy as sc

#### Read the scRNA matrix from a file


In [None]:
mmx = mmread("filtered_matrices_mex/hg19/matrix.mtx")


#### Create an AnnData object wrapping the read data

Notice that this code transposes the data to the correct direction

In [None]:
 
anndata_object = anndata.AnnData(X=mmx.transpose().tocsr())
print(anndata_object.X.shape)

(68579, 32738)


In [None]:
# Cell identifiers
barcodes = pd.read_csv("filtered_matrices_mex/hg19/barcodes.tsv",header=None,sep="\t")
# names of genes
genes = pd.read_csv("filtered_matrices_mex/hg19/genes.tsv",header=None,sep="\t")


In [None]:
# cell types (this is actualy just one column)
cell_type_lables = pd.read_csv("zheng17_bulk_lables.txt",header=None,sep="\t")

In [None]:

# use the gene names as variable names in the AnnData object
anndata_object.var_names=genes[1]

# use the cell barcodes as names for the samples
anndata_object.obs_names=barcodes[0]



In [None]:

# use cell types as labels for the samples
anndata_object.obs['celltype']=cell_type_lables.squeeze().to_numpy()

In [None]:
# Save result anndata object to disk
anndata_object.write_h5ad("Zhang_68k.h5ad")

## And the annData file is ready in the data directory.

In [None]:
# # process the data - filter out cells with shallow reads, normelize depth and change to log scale of about 0-10 (log_2(1001)~=10)

# sc.pp.filter_cells(anndata_object,min_genes=200)
# sc.pp.normalize_total(anndata_object,1000.)å
# sc.pp.log1p(anndata_object,base=2)


In [None]:
# anndata_object.X.data.min(), anndata_object.X.max()

In [None]:
# # split range to bins - more or less 0,2,3,..10
# bins=np.linspace(anndata_object.X.data.min(), anndata_object.X.max(),num=11)
# bins

In [None]:
# # convert the counts to bins
# anndata_object.X.data=np.digitize(anndata_object.X.data, bins)

In [None]:
# # Save result anndata object to disk
# anndata_object.write_h5ad("Zhang68k_filtered.h5ad")

In [None]:
# def convert_to_double_sorted_geneformer_sequance(anndata_object):
#     # the genes are sorted by expression bin (decending) and within the bin by the gene names.
    
#     return [a[1] for a in sorted(zip(-anndata_object.X.data,anndata_object.var_names.to_numpy()[anndata_object.X.indices]))]

In [None]:
# convert_to_double_sorted_geneformer_sequance(anndata_object)

In [None]:
# anndata_object.n_obs