# Notebook to pack Zhang68k data from x10cell into an AnnData (h5ad) file.  



This notebook assumes that it is ran from the `bmfm-mammal-release/mammal/examples/scrna_cell_type` directory.  

In [None]:
# check the current directory.  Notice that the `biomed-multi-alignment` will probably be placed in a different location on your system.
!pwd

/Users/matann/git/biomed-multi-alignment/mammal/examples/scrna_cell_type/data



## Obtaining the raw data:
The main data is availble online, for example in the [10xgenomics](https://www.10xgenomics.com/) cite.  The lables are based on the data in [LINK](https://www.10xgenomics.com/datasets/fresh-68-k-pbm-cs-donor-a-1-standard-1-1-0)

From this download the file `fresh_68k_pbmc_donor_a_filtered_gene_bc_matrices.tar.gz` and place it in this directory. 

In [None]:
!ls -sh

total 243080
     8 README.md
    16 clear_data_prep.ipynb
243056 fresh_68k_pbmc_donor_a_filtered_gene_bc_matrices.tar.gz


 Unzip it.  You should now have a directoy called `filtered_matrices_mex/hg19` 

In [None]:
!tar -xzvf fresh_68k_pbmc_donor_a_filtered_gene_bc_matrices.tar.gz 

x filtered_matrices_mex/
x filtered_matrices_mex/hg19/
x filtered_matrices_mex/hg19/barcodes.tsv
x filtered_matrices_mex/hg19/genes.tsv
x filtered_matrices_mex/hg19/matrix.mtx


In [None]:
!ls -shR --color=never

total 243080
     8 README.md
    16 clear_data_prep.ipynb
     0 filtered_matrices_mex
243056 fresh_68k_pbmc_donor_a_filtered_gene_bc_matrices.tar.gz

./filtered_matrices_mex:
total 0
0 hg19

./filtered_matrices_mex/hg19:
total 993144
  2280 barcodes.tsv	  1600 genes.tsv	989264 matrix.mtx


The output should be something like

```
total 243080
     8 README.md
    16 clear_data_prep.ipynb
     0 filtered_matrices_mex
243056 fresh_68k_pbmc_donor_a_filtered_gene_bc_matrices.tar.gz

./filtered_matrices_mex:
total 0
0 hg19

./filtered_matrices_mex/hg19:
total 993144
  2280 barcodes.tsv	  1600 genes.tsv	989264 matrix.mtx
  ```

In [None]:
import anndata 
# from collections import Counter
import pandas as pd
# import matplotlib.pyplot as plt
import math
import numpy as np
# from scipy.sparse import csr_matrix
from scipy.io import mmread
import scanpy as sc

ImportError: Numba needs NumPy 2.1 or less. Got NumPy 2.2.

# read the scRNA matrix from a file


In [None]:
mmx = mmread("data/filtered_matrices_mex/hg19/matrix.mtx")


In [None]:
# create an AnnData object arouond the read data
# this transposes the data and 
anndata_object = anndata.AnnData(X=mmx.transpose().tocsr())

In [None]:
# Cell identfies
barcodes = pd.read_csv("data/filtered_matrices_mex/hg19/barcodes.tsv",header=None,sep="\t")
# names of genes
genes = pd.read_csv("data/filtered_matrices_mex/hg19/genes.tsv",header=None,sep="\t")
# cell types
cell_type_lables = pd.read_csv("data/zheng17_bulk_lables.txt",header=None)

In [None]:

# use the gene names as variable names in the AnnData object
anndata_object.var_names=genes[1]

# use the cell barcodes as names for the samples
anndata_object.obs_names=barcodes[0]

# use cell types as labels for the samples
anndata_object.obs['celltype']=cell_type_lables.squeeze().to_numpy()

In [None]:
# Save result anndata object to disk
anndata_object.write_h5ad("Zhang_68k_processed.h5ad")

In [None]:
# process the data - filter out cells with shallow reads, normelize depth and change to log scale of about 0-10 (log_2(1001)~=10)

sc.pp.filter_cells(anndata_object,min_genes=200)
sc.pp.normalize_total(anndata_object,1000.)
sc.pp.log1p(anndata_object,base=2)


  utils.warn_names_duplicates("var")
  utils.warn_names_duplicates("var")
  sc.pp.normalize_total(anndata_object,1000.)


In [None]:
# split range to bins - more or less 0,2,3,..10
bins=np.linspace(anndata_object.X.data.min(), anndata_object.X.max(),num=10)
bins

array([0.13107748, 0.95458889, 1.7781003 , 2.6016117 , 3.42512311,
       4.24863452, 5.07214593, 5.89565734, 6.71916875, 7.54268016])

In [None]:
# convert the counts to bins
anndata_object.X.data=np.digitize(anndata_object.X.data, bins)

In [None]:
# Save result anndata object to disk
anndata_object.write_h5ad("data/Zhang68k_filtered.h5ad")

In [None]:
def convert_to_double_sorted_geneformer_sequance(anndata_object, key):
    # the genes are sorted by expression bin (decending) and within the bin by the gene names.
    
    return [a[1] for a in sorted(zip(-anndata_object[key].X.data,anndata_object.var_names.to_numpy()[anndata_object[key].X.indices]))]

In [None]:
!pwd

/Users/matann/git/bmfm-mammal-release/mammal/examples/cell_type_new
